Skip to main navigation Skip to search Skip to main content

xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations

  • Can Qin
  • , Congying Xia
  • , Krithika Ramakrishnan
  • , Michael Ryoo
  • , Lifu Tu
  • , Yihao Feng
  • , Manli Shu
  • , Honglu Zhou
  • , Anas Awadalla
  • , Jun Wang
  • , Senthil Purushwalkam
  • , Le Xue
  • , Yingbo Zhou
  • , Huan Wang
  • , Silvio Savarese
  • , Juan Carlos Niebles
  • , Zeyuan Chen
  • , Ran Xu
  • , Caiming Xiong
  • Salesforce.com, Inc.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. We extend the latent diffusion model (LDM) architecture by introducing a video variational autoencoder (VidVAE). Our Video VAE compresses video data spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational cost, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different time frames and aspect ratios. We also designed a data collection and processing pipeline, which helped us gather over 13 million high-quality video-text pairs. The pipeline includes steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our xGen-MM video-language model. Training the Video VAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2024 Workshops, Proceedings
EditorsAlessio Del Bue, Cristian Canton, Jordi Pont-Tuset, Tatiana Tommasi
PublisherSpringer Science and Business Media Deutschland GmbH
Pages249-265
Number of pages17
ISBN (Print)9783031928079
DOIs
StatePublished - 2025
EventWorkshops that were held in conjunction with the 18th European Conference on Computer Vision, ECCV 2024 - Milan, Italy
Duration: Sep 29 2024Oct 4 2024

Publication series

NameLecture Notes in Computer Science
Volume15627 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceWorkshops that were held in conjunction with the 18th European Conference on Computer Vision, ECCV 2024
Country/TerritoryItaly
CityMilan
Period09/29/2410/4/24

Keywords

  • Diffusion
  • Text-to-Video
  • Variational Autoencoder

Fingerprint

Dive into the research topics of 'xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations'. Together they form a unique fingerprint.

Cite this