时光之念：时间控制的多事件视频生成

摘要

现实世界的视频由事件序列组成。使用现有依赖单个文本段落作为输入的视频生成器精确控制这些序列的时间是不可行的。当要求生成使用单个提示描述的多个事件时，这些方法通常会忽略一些事件或未能按正确顺序排列它们。为了解决这一限制，我们提出了MinT，一种具有时间控制的多事件视频生成器。我们的关键洞察是将每个事件绑定到生成视频中的特定时期，这使模型能够一次专注于一个事件。为了实现事件标题和视频标记之间的时间感知交互，我们设计了一种基于时间的位置编码方法，称为ReRoPE。这种编码有助于引导交叉注意力操作。通过在具有时间基础数据的预训练视频扩散变压器上进行微调，我们的方法生成具有平滑连接事件的连贯视频。在文献中首次，我们的模型提供了对生成视频中事件时间的控制。大量实验证明，MinT在性能上大幅优于现有的开源模型。

English

Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing open-source models by a large margin.

时光之念：时间控制的多事件视频生成

Mind the Time: Temporally-Controlled Multi-Event Video Generation

摘要

Summary

Support

Support