Ouroboros-Diffusion:探索无调参长视频扩散中的一致内容生成
Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
January 15, 2025
作者: Jingyuan Chen, Fuchen Long, Jie An, Zhaofan Qiu, Ting Yao, Jiebo Luo, Tao Mei
cs.AI
摘要
基于预训练的文本到视频模型构建的先进先出(FIFO)视频扩散,最近已成为一种用于无调参长视频生成的有效方法。该技术维护一个视频帧队列,随着噪声不断增加,持续在队列头部生成清晰帧,同时在队尾入队高斯噪声。然而,FIFO-Diffusion经常难以保持生成视频中的长距离时间一致性,这是由于缺乏跨帧之间的对应建模。在本文中,我们提出了Ouroboros-Diffusion,这是一种旨在增强结构和内容(主题)一致性的新型视频去噪框架,从而实现任意长度一致视频的生成。具体而言,我们引入了一种新的潜在采样技术,用于改善结构一致性,确保帧之间的感知平滑过渡。为了增强主题一致性,我们设计了一种主题感知跨帧注意力(SACFA)机制,该机制在短片段内对齐帧之间的主题,以实现更好的视觉连贯性。此外,我们引入了自回归引导。这种技术利用队列前端所有先前更清晰帧的信息来引导末端更嘈杂帧的去噪,促进丰富和上下文全局信息交互。在VBench基准测试上进行的大量长视频生成实验表明,我们的Ouroboros-Diffusion在主题一致性、动作平滑度和时间一致性方面表现优越。
English
The first-in-first-out (FIFO) video diffusion, built on a pre-trained
text-to-video model, has recently emerged as an effective approach for
tuning-free long video generation. This technique maintains a queue of video
frames with progressively increasing noise, continuously producing clean frames
at the queue's head while Gaussian noise is enqueued at the tail. However,
FIFO-Diffusion often struggles to keep long-range temporal consistency in the
generated videos due to the lack of correspondence modeling across frames. In
this paper, we propose Ouroboros-Diffusion, a novel video denoising framework
designed to enhance structural and content (subject) consistency, enabling the
generation of consistent videos of arbitrary length. Specifically, we introduce
a new latent sampling technique at the queue tail to improve structural
consistency, ensuring perceptually smooth transitions among frames. To enhance
subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA)
mechanism, which aligns subjects across frames within short segments to achieve
better visual coherence. Furthermore, we introduce self-recurrent guidance.
This technique leverages information from all previous cleaner frames at the
front of the queue to guide the denoising of noisier frames at the end,
fostering rich and contextual global information interaction. Extensive
experiments of long video generation on the VBench benchmark demonstrate the
superiority of our Ouroboros-Diffusion, particularly in terms of subject
consistency, motion smoothness, and temporal consistency.Summary
AI-Generated Summary