DreamRunner:利用檢索增強運動適應的細粒度敘事視頻生成

DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

November 25, 2024
作者: Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal
cs.AI

摘要

故事性視頻生成(SVG)最近成為一項任務,旨在創建長篇、多動作、多場景視頻,以一致地呈現輸入文本劇本中描述的故事。SVG在媒體和娛樂領域的多樣內容創作中具有巨大潛力;然而,它也帶來了重大挑戰:(1)對象必須展現一系列精細、復雜的動作,(2)多個對象需要在各場景中一致出現,以及(3)主題可能需要在單一場景中進行多個動作,並實現無縫過渡。為應對這些挑戰,我們提出了DreamRunner,一種新穎的故事到視頻生成方法:首先,我們使用大型語言模型(LLM)對輸入劇本進行結構化,以促進粗粒度場景規劃以及細粒度對象級佈局和動作規劃。接下來,DreamRunner提出了檢索增強的測試時適應,捕捉每個場景中對象的目標運動先驗,支持基於檢索視頻的多樣運動定制,從而促進生成具有復雜劇本動作的新視頻。最後,我們提出了一種新穎的基於空間-時間區域的3D注意力和先驗注入模塊SR3AI,用於細粒度對象運動綁定和逐幀語義控制。我們將DreamRunner與各種SVG基準進行比較,展示了在角色一致性、文本對齊和平滑過渡方面的最新性能。此外,DreamRunner在組合式文本到視頻生成中展現出強大的細粒度條件遵循能力,在T2V-ComBench上明顯優於基準。最後,我們通過定性示例驗證了DreamRunner生成多對象交互的強大能力。
English
Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.

Summary

AI-Generated Summary

PDF172November 26, 2024