時光之心:時間控制的多事件影片生成

Mind the Time: Temporally-Controlled Multi-Event Video Generation

December 6, 2024
作者: Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschenski, Sergey Tulyakov
cs.AI

摘要

現實世界的影片由一系列事件組成。使用現有的依賴單一段文字作為輸入的影片生成器,要精確控制這些事件序列是不可行的。當要求生成使用單一提示描述的多個事件時,這些方法通常會忽略某些事件或未能按正確順序排列它們。為了解決這個限制,我們提出了MinT,一個具有時間控制的多事件影片生成器。我們的關鍵洞察是將每個事件綁定到生成的影片中的特定時期,這使模型能夠一次專注於一個事件。為了實現事件標題和影片標記之間的時間感知交互作用,我們設計了一種基於時間的位置編碼方法,名為ReRoPE。這種編碼有助於引導交叉注意力操作。通過在具有時間基礎數據的預訓練影片擴散變壓器上進行微調,我們的方法生成具有平滑連接事件的連貫影片。在文獻中首次,我們的模型提供了對生成影片中事件時間的控制。廣泛的實驗表明,MinT在性能上遠遠優於現有的開源模型。
English
Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing open-source models by a large margin.

Summary

AI-Generated Summary

PDF102December 9, 2024