Ouroboros-Diffusion:探索在無調整的長視頻擴散中的一致性內容生成
Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
January 15, 2025
作者: Jingyuan Chen, Fuchen Long, Jie An, Zhaofan Qiu, Ting Yao, Jiebo Luo, Tao Mei
cs.AI
摘要
建立在預訓練文本到影片模型基礎上的先進先出(FIFO)影片擴散,最近被證實是調整免費長影片生成的有效方法。該技術維護一個影片幀隊列,隨著噪音逐漸增加,持續在隊列頭部生成乾淨幀,同時在尾部加入高斯噪音。然而,FIFO-Diffusion常常難以保持生成影片中的長程時間一致性,這是因為缺乏跨幀之間的對應建模。在本文中,我們提出了Ouroboros-Diffusion,一個新穎的影片去噪框架,旨在增強結構和內容(主題)一致性,從而實現任意長度一致性影片的生成。具體來說,我們引入了一種新的潛在採樣技術,用於改善結構一致性,確保幀之間的感知平滑過渡。為了增強主題一致性,我們設計了一種主題感知跨幀注意(SACFA)機制,該機制在短片段內對幀之間的主題進行對齊,以實現更好的視覺連貫性。此外,我們引入了自遞歸引導。這種技術利用隊列前端所有先前更清晰幀的信息來引導結尾更嘈雜幀的去噪,促進豐富且上下文全局信息的交互。在VBench基準測試上進行的大量長影片生成實驗顯示了我們的Ouroboros-Diffusion的優越性,特別是在主題一致性、運動平滑度和時間一致性方面。
English
The first-in-first-out (FIFO) video diffusion, built on a pre-trained
text-to-video model, has recently emerged as an effective approach for
tuning-free long video generation. This technique maintains a queue of video
frames with progressively increasing noise, continuously producing clean frames
at the queue's head while Gaussian noise is enqueued at the tail. However,
FIFO-Diffusion often struggles to keep long-range temporal consistency in the
generated videos due to the lack of correspondence modeling across frames. In
this paper, we propose Ouroboros-Diffusion, a novel video denoising framework
designed to enhance structural and content (subject) consistency, enabling the
generation of consistent videos of arbitrary length. Specifically, we introduce
a new latent sampling technique at the queue tail to improve structural
consistency, ensuring perceptually smooth transitions among frames. To enhance
subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA)
mechanism, which aligns subjects across frames within short segments to achieve
better visual coherence. Furthermore, we introduce self-recurrent guidance.
This technique leverages information from all previous cleaner frames at the
front of the queue to guide the denoising of noisier frames at the end,
fostering rich and contextual global information interaction. Extensive
experiments of long video generation on the VBench benchmark demonstrate the
superiority of our Ouroboros-Diffusion, particularly in terms of subject
consistency, motion smoothness, and temporal consistency.Summary
AI-Generated Summary