TY - GEN
T1 - xGen-VideoSyn-1
T2 - Workshops that were held in conjunction with the 18th European Conference on Computer Vision, ECCV 2024
AU - Qin, Can
AU - Xia, Congying
AU - Ramakrishnan, Krithika
AU - Ryoo, Michael
AU - Tu, Lifu
AU - Feng, Yihao
AU - Shu, Manli
AU - Zhou, Honglu
AU - Awadalla, Anas
AU - Wang, Jun
AU - Purushwalkam, Senthil
AU - Xue, Le
AU - Zhou, Yingbo
AU - Wang, Huan
AU - Savarese, Silvio
AU - Niebles, Juan Carlos
AU - Chen, Zeyuan
AU - Xu, Ran
AU - Xiong, Caiming
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. We extend the latent diffusion model (LDM) architecture by introducing a video variational autoencoder (VidVAE). Our Video VAE compresses video data spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational cost, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different time frames and aspect ratios. We also designed a data collection and processing pipeline, which helped us gather over 13 million high-quality video-text pairs. The pipeline includes steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our xGen-MM video-language model. Training the Video VAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.
AB - We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. We extend the latent diffusion model (LDM) architecture by introducing a video variational autoencoder (VidVAE). Our Video VAE compresses video data spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational cost, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different time frames and aspect ratios. We also designed a data collection and processing pipeline, which helped us gather over 13 million high-quality video-text pairs. The pipeline includes steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our xGen-MM video-language model. Training the Video VAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.
KW - Diffusion
KW - Text-to-Video
KW - Variational Autoencoder
UR - https://www.scopus.com/pages/publications/105016704360
U2 - 10.1007/978-3-031-92808-6_16
DO - 10.1007/978-3-031-92808-6_16
M3 - Conference contribution
AN - SCOPUS:105016704360
SN - 9783031928079
T3 - Lecture Notes in Computer Science
SP - 249
EP - 265
BT - Computer Vision – ECCV 2024 Workshops, Proceedings
A2 - Del Bue, Alessio
A2 - Canton, Cristian
A2 - Pont-Tuset, Jordi
A2 - Tommasi, Tatiana
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 29 September 2024 through 4 October 2024
ER -