ウロボロス・ディフュージョン：チューニング不要の長時間ビデオディフュージョンにおける一貫したコンテンツ生成の探索

要旨

最近、事前学習済みのテキストからビデオへのモデルを用いた先入れ先出し（FIFO）ビデオ拡散が、調整不要の長いビデオ生成において効果的な手法として登場しました。この手法は、徐々にノイズが増加するビデオフレームのキューを維持し、キューの先頭でクリーンなフレームを継続的に生成しながら、テールにはガウスノイズが追加されます。しかし、FIFO-Diffusionは、フレーム間の対応モデリングの不足により、生成されたビデオで長距離の時間的一貫性を保つのに苦労することがよくあります。本論文では、任意の長さの一貫性のあるビデオの生成を可能にする、構造的および内容（主題）の一貫性を高めるために設計された新しいビデオノイズ除去フレームワークであるOuroboros-Diffusionを提案します。具体的には、構造的一貫性を向上させるために、キューの末尾で新しい潜在的サンプリング技術を導入し、フレーム間の知覚的に滑らかな遷移を確保します。主題の一貫性を向上させるために、短いセグメント内でフレーム間の主題を整列させ、より良い視覚的一貫性を達成するSubject-Aware Cross-Frame Attention（SACFA）メカニズムを考案します。さらに、セルフリカレントガイダンスを導入します。この技術は、キューの前部のすべての以前のクリーンなフレームからの情報を活用して、末尾のノイジーなフレームのノイズ除去をガイドし、豊富で文脈的なグローバル情報の相互作用を促進します。VBenchベンチマークでの長いビデオ生成の広範な実験は、特に主題の一貫性、動きの滑らかさ、時間的一貫性の観点から、当社のOuroboros-Diffusionの優越性を示しています。

English

The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.

ウロボロス・ディフュージョン：チューニング不要の長時間ビデオディフュージョンにおける一貫したコンテンツ生成の探索

Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

要旨

Support