TY - GEN
T1 - Video Question Answering with Iterative Video-Text Co-tokenization
AU - Piergiovanni, Aj
AU - Morton, Kairo
AU - Kuo, Weicheng
AU - Ryoo, Michael S.
AU - Angelova, Anelia
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150–360 to only 67, producing a highly efficient video question answering model (Code: https://sites.google.com/view/videoqa-cotokenization ).
AB - Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150–360 to only 67, producing a highly efficient video question answering model (Code: https://sites.google.com/view/videoqa-cotokenization ).
KW - Efficient vision models
KW - Video question answering
KW - Video understanding
KW - Video-text joint learning
UR - https://www.scopus.com/pages/publications/85142758572
U2 - 10.1007/978-3-031-20059-5_5
DO - 10.1007/978-3-031-20059-5_5
M3 - Conference contribution
AN - SCOPUS:85142758572
SN - 9783031200588
T3 - Lecture Notes in Computer Science
SP - 76
EP - 94
BT - Computer Vision – ECCV 2022 - 17th European Conference, Proceedings
A2 - Avidan, Shai
A2 - Brostow, Gabriel
A2 - Cissé, Moustapha
A2 - Farinella, Giovanni Maria
A2 - Hassner, Tal
PB - Springer Science and Business Media Deutschland GmbH
T2 - 17th European Conference on Computer Vision, ECCV 2022
Y2 - 23 October 2022 through 27 October 2022
ER -