Skip to main navigation Skip to search Skip to main content

Progressive Autoregressive Video Diffusion Models

  • Desai Xie
  • , Zhan Xu
  • , Yicong Hong
  • , Hao Tan
  • , Difan Liu
  • , Feng Liu
  • , Arie Kaufman
  • , Yang Zhou
  • Stony Brook University
  • Adobe Systems Incorporated

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. Existing methods naively achieve autoregressive long video generation by directly placing the ending of the previous clip at the front of the attention window as conditioning, which leads to abrupt scene changes, unnatural motion, and error accumulation. In this work, we introduce a more natural formulation of autoregressive long video generation by revisiting the noise level assumption in video diffusion models. Our key idea is to 1. assign the frames with per-frame, progressively increasing noise levels rather than a single noise level and 2. denoise and shift the frames in small intervals rather than all at once. This allows for smoother attention correspondence among frames with adjacent noise levels, larger overlaps between the attention windows, and better propagation of information from the earlier to the later frames. Video diffusion models equipped with our progressive noise schedule can autoregressively generate long videos with much improved fidelity compared to the baselines and minimal quality degradation over time. We present the first results on text-conditioned 60 -second (1440 frames) long video generation at a quality close to frontier models. Code and video results are available at https://desaixie.github.io/pa-vdm/.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
PublisherIEEE Computer Society
Pages6312-6322
Number of pages11
ISBN (Electronic)9798331599942
DOIs
StatePublished - 2025
Event2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025 - Nashville, United States
Duration: Jun 11 2025Jun 12 2025

Publication series

NameIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
ISSN (Print)2160-7508
ISSN (Electronic)2160-7516

Conference

Conference2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
Country/TerritoryUnited States
CityNashville
Period06/11/2506/12/25

Keywords

  • Diffusion Models
  • Long video generation
  • Transformer

Fingerprint

Dive into the research topics of 'Progressive Autoregressive Video Diffusion Models'. Together they form a unique fingerprint.

Cite this