Skip to main navigation Skip to search Skip to main content

TokenLearner: Adaptive Space-Time Tokenization for Videos

  • Michael S. Ryoo
  • , A. J. Piergiovanni
  • , Anurag Arnab
  • , Mostafa Dehghani
  • , Anelia Angelova
  • Alphabet Inc.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

139 Scopus citations

Abstract

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in image frames. Our experiments demonstrate strong performance on several challenging benchmarks for video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced computational cost. We establish new state-of-the-arts on multiple video datasets, including Kinetics-400, Kinetics-600, Charades, and AViD.

Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
EditorsMarc'Aurelio Ranzato, Alina Beygelzimer, Yann Dauphin, Percy S. Liang, Jenn Wortman Vaughan
PublisherNeural information processing systems foundation
Pages12786-12797
Number of pages12
ISBN (Electronic)9781713845393
StatePublished - 2021
Event35th Conference on Neural Information Processing Systems, NeurIPS 2021 - Virtual, Online
Duration: Dec 6 2021Dec 14 2021

Publication series

NameAdvances in Neural Information Processing Systems
Volume16
ISSN (Print)1049-5258

Conference

Conference35th Conference on Neural Information Processing Systems, NeurIPS 2021
CityVirtual, Online
Period12/6/2112/14/21

Fingerprint

Dive into the research topics of 'TokenLearner: Adaptive Space-Time Tokenization for Videos'. Together they form a unique fingerprint.

Cite this