梦境奔跑者:利用检索增强的动作适应进行细粒度故事视频生成
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
November 25, 2024
作者: Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal
cs.AI
摘要
最近出现了讲故事视频生成(SVG)作为一项任务,旨在创建长篇、多动作、多场景视频,以一致地呈现输入文本脚本中描述的故事。SVG在媒体和娱乐领域的多样内容创作中具有巨大潜力;然而,它也带来了重大挑战:(1)对象必须展示一系列精细、复杂的动作,(2)多个对象需要在各个场景中保持一致出现,(3)主体可能需要在单个场景内进行多个动作,并实现无缝过渡。为了解决这些挑战,我们提出了DreamRunner,一种新颖的故事到视频生成方法:首先,我们使用大型语言模型(LLM)对输入脚本进行结构化,以促进粗粒度场景规划以及细粒度对象级布局和动作规划。接下来,DreamRunner 提出了检索增强的测试时适应方法,捕捉每个场景中对象的目标运动先验,支持基于检索视频的多样动作定制,从而促进生成具有复杂、脚本化动作的新视频。最后,我们提出了一种新颖的基于空间-时间区域的三维注意力和先验注入模块 SR3AI,用于细粒度对象运动绑定和逐帧语义控制。我们将DreamRunner与各种SVG基线进行比较,展示了在角色一致性、文本对齐和平滑过渡方面的最新性能。此外,DreamRunner 在组合式文本到视频生成中表现出强大的细粒度条件遵循能力,在T2V-ComBench上明显优于基线。最后,我们通过定性示例验证了DreamRunner 生成多对象交互的稳健能力。
English
Storytelling video generation (SVG) has recently emerged as a task to create
long, multi-motion, multi-scene videos that consistently represent the story
described in the input text script. SVG holds great potential for diverse
content creation in media and entertainment; however, it also presents
significant challenges: (1) objects must exhibit a range of fine-grained,
complex motions, (2) multiple objects need to appear consistently across
scenes, and (3) subjects may require multiple motions with seamless transitions
within a single scene. To address these challenges, we propose DreamRunner, a
novel story-to-video generation method: First, we structure the input script
using a large language model (LLM) to facilitate both coarse-grained scene
planning as well as fine-grained object-level layout and motion planning. Next,
DreamRunner presents retrieval-augmented test-time adaptation to capture target
motion priors for objects in each scene, supporting diverse motion
customization based on retrieved videos, thus facilitating the generation of
new videos with complex, scripted motions. Lastly, we propose a novel
spatial-temporal region-based 3D attention and prior injection module SR3AI for
fine-grained object-motion binding and frame-by-frame semantic control. We
compare DreamRunner with various SVG baselines, demonstrating state-of-the-art
performance in character consistency, text alignment, and smooth transitions.
Additionally, DreamRunner exhibits strong fine-grained condition-following
ability in compositional text-to-video generation, significantly outperforming
baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to
generate multi-object interactions with qualitative examples.Summary
AI-Generated Summary