Skip to main navigation Skip to search Skip to main content

Video Question Answering with Iterative Video-Text Co-tokenization

  • Aj Piergiovanni
  • , Kairo Morton
  • , Weicheng Kuo
  • , Michael S. Ryoo
  • , Anelia Angelova
  • Alphabet Inc.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

14 Scopus citations

Abstract

Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150–360 to only 67, producing a highly efficient video question answering model (Code: https://sites.google.com/view/videoqa-cotokenization ).

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2022 - 17th European Conference, Proceedings
EditorsShai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
PublisherSpringer Science and Business Media Deutschland GmbH
Pages76-94
Number of pages19
ISBN (Print)9783031200588
DOIs
StatePublished - 2022
Event17th European Conference on Computer Vision, ECCV 2022 - Tel Aviv, Israel
Duration: Oct 23 2022Oct 27 2022

Publication series

NameLecture Notes in Computer Science
Volume13696 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th European Conference on Computer Vision, ECCV 2022
Country/TerritoryIsrael
CityTel Aviv
Period10/23/2210/27/22

Keywords

  • Efficient vision models
  • Video question answering
  • Video understanding
  • Video-text joint learning

Fingerprint

Dive into the research topics of 'Video Question Answering with Iterative Video-Text Co-tokenization'. Together they form a unique fingerprint.

Cite this