时光之念:时间控制的多事件视频生成
Mind the Time: Temporally-Controlled Multi-Event Video Generation
December 6, 2024
作者: Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschenski, Sergey Tulyakov
cs.AI
摘要
现实世界的视频由事件序列组成。使用现有依赖单个文本段落作为输入的视频生成器精确控制这些序列的时间是不可行的。当要求生成使用单个提示描述的多个事件时,这些方法通常会忽略一些事件或未能按正确顺序排列它们。为了解决这一限制,我们提出了MinT,一种具有时间控制的多事件视频生成器。我们的关键洞察是将每个事件绑定到生成视频中的特定时期,这使模型能够一次专注于一个事件。为了实现事件标题和视频标记之间的时间感知交互,我们设计了一种基于时间的位置编码方法,称为ReRoPE。这种编码有助于引导交叉注意力操作。通过在具有时间基础数据的预训练视频扩散变压器上进行微调,我们的方法生成具有平滑连接事件的连贯视频。在文献中首次,我们的模型提供了对生成视频中事件时间的控制。大量实验证明,MinT在性能上大幅优于现有的开源模型。
English
Real-world videos consist of sequences of events. Generating such sequences
with precise temporal control is infeasible with existing video generators that
rely on a single paragraph of text as input. When tasked with generating
multiple events described using a single prompt, such methods often ignore some
of the events or fail to arrange them in the correct order. To address this
limitation, we present MinT, a multi-event video generator with temporal
control. Our key insight is to bind each event to a specific period in the
generated video, which allows the model to focus on one event at a time. To
enable time-aware interactions between event captions and video tokens, we
design a time-based positional encoding method, dubbed ReRoPE. This encoding
helps to guide the cross-attention operation. By fine-tuning a pre-trained
video diffusion transformer on temporally grounded data, our approach produces
coherent videos with smoothly connected events. For the first time in the
literature, our model offers control over the timing of events in generated
videos. Extensive experiments demonstrate that MinT outperforms existing
open-source models by a large margin.Summary
AI-Generated Summary