TY - GEN
T1 - Progressive Autoregressive Video Diffusion Models
AU - Xie, Desai
AU - Xu, Zhan
AU - Hong, Yicong
AU - Tan, Hao
AU - Liu, Difan
AU - Liu, Feng
AU - Kaufman, Arie
AU - Zhou, Yang
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. Existing methods naively achieve autoregressive long video generation by directly placing the ending of the previous clip at the front of the attention window as conditioning, which leads to abrupt scene changes, unnatural motion, and error accumulation. In this work, we introduce a more natural formulation of autoregressive long video generation by revisiting the noise level assumption in video diffusion models. Our key idea is to 1. assign the frames with per-frame, progressively increasing noise levels rather than a single noise level and 2. denoise and shift the frames in small intervals rather than all at once. This allows for smoother attention correspondence among frames with adjacent noise levels, larger overlaps between the attention windows, and better propagation of information from the earlier to the later frames. Video diffusion models equipped with our progressive noise schedule can autoregressively generate long videos with much improved fidelity compared to the baselines and minimal quality degradation over time. We present the first results on text-conditioned 60 -second (1440 frames) long video generation at a quality close to frontier models. Code and video results are available at https://desaixie.github.io/pa-vdm/.
AB - Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. Existing methods naively achieve autoregressive long video generation by directly placing the ending of the previous clip at the front of the attention window as conditioning, which leads to abrupt scene changes, unnatural motion, and error accumulation. In this work, we introduce a more natural formulation of autoregressive long video generation by revisiting the noise level assumption in video diffusion models. Our key idea is to 1. assign the frames with per-frame, progressively increasing noise levels rather than a single noise level and 2. denoise and shift the frames in small intervals rather than all at once. This allows for smoother attention correspondence among frames with adjacent noise levels, larger overlaps between the attention windows, and better propagation of information from the earlier to the later frames. Video diffusion models equipped with our progressive noise schedule can autoregressively generate long videos with much improved fidelity compared to the baselines and minimal quality degradation over time. We present the first results on text-conditioned 60 -second (1440 frames) long video generation at a quality close to frontier models. Code and video results are available at https://desaixie.github.io/pa-vdm/.
KW - Diffusion Models
KW - Long video generation
KW - Transformer
UR - https://www.scopus.com/pages/publications/105017842240
U2 - 10.1109/CVPRW67362.2025.00628
DO - 10.1109/CVPRW67362.2025.00628
M3 - Conference contribution
AN - SCOPUS:105017842240
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 6312
EP - 6322
BT - Proceedings - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
PB - IEEE Computer Society
T2 - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
Y2 - 11 June 2025 through 12 June 2025
ER -