Skip to main navigation Skip to search Skip to main content

Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities

  • A. J. Piergiovanni
  • , Isaac Noble
  • , Dahun Kim
  • , Michael S. Ryoo
  • , Victor Gomes
  • , Anelia Angelova
  • Alphabet Inc.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Scopus citations

Abstract

One of the main challenges of multimodal learning is combining multiple heterogeneous modalities, e.g., video, audio, and text. Video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g. a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities, and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly, producing compact but expressive representations. This allows us to scale to 512 input video frames without increase in model parameters. Our approach achieves the state-of-the-art on multiple well established multimodal benchmarks. It effectively addresses the high computational demand of media inputs by learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherIEEE Computer Society
Pages26794-26804
Number of pages11
ISBN (Electronic)9798350353006
DOIs
StatePublished - 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: Jun 16 2024Jun 22 2024

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/TerritoryUnited States
CitySeattle
Period06/16/2406/22/24

Fingerprint

Dive into the research topics of 'Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities'. Together they form a unique fingerprint.

Cite this