無需訓練的文本到視頻生成指導:通過多模態規劃與結構化噪聲初始化
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
April 11, 2025
作者: Jialu Li, Shoubin Yu, Han Lin, Jaemin Cho, Jaehong Yoon, Mohit Bansal
cs.AI
摘要
近期,文本到視頻(T2V)擴散模型的進展顯著提升了生成視頻的視覺質量。然而,即便是最新的T2V模型,在精確遵循文本描述方面仍面臨挑戰,尤其是在提示需要精確控制空間佈局或物體軌跡時。最近的一項研究利用佈局指導來改進T2V模型,這需要在推理時進行微調或對注意力圖進行迭代操作,這大大增加了內存需求,使得難以採用大型T2V模型作為骨幹。為解決這一問題,我們引入了Video-MSG,這是一種基於多模態規劃和結構化噪聲初始化的無需訓練的T2V生成指導方法。Video-MSG包含三個步驟,在前兩個步驟中,Video-MSG創建視頻草圖,這是一個細粒度的時空計劃,用於指定背景、前景和物體軌跡,並以草稿視頻幀的形式呈現。在最後一步中,Video-MSG通過噪聲反轉和去噪,利用視頻草圖指導下游的T2V擴散模型。值得注意的是,Video-MSG在推理時無需微調或進行額外的注意力操作,從而更容易採用大型T2V模型。Video-MSG在多個T2V骨幹模型(VideoCrafter2和CogVideoX-5B)上,在流行的T2V生成基準(T2VCompBench和VBench)上展示了其在增強文本對齊方面的有效性。我們提供了關於噪聲反轉比例、不同背景生成器、背景物體檢測和前景物體分割的全面消融研究。
English
Recent advancements in text-to-video (T2V) diffusion models have
significantly enhanced the visual quality of the generated videos. However,
even recent T2V models find it challenging to follow text descriptions
accurately, especially when the prompt requires accurate control of spatial
layouts or object trajectories. A recent line of research uses layout guidance
for T2V models that require fine-tuning or iterative manipulation of the
attention map during inference time. This significantly increases the memory
requirement, making it difficult to adopt a large T2V model as a backbone. To
address this, we introduce Video-MSG, a training-free Guidance method for T2V
generation based on Multimodal planning and Structured noise initialization.
Video-MSG consists of three steps, where in the first two steps, Video-MSG
creates Video Sketch, a fine-grained spatio-temporal plan for the final video,
specifying background, foreground, and object trajectories, in the form of
draft video frames. In the last step, Video-MSG guides a downstream T2V
diffusion model with Video Sketch through noise inversion and denoising.
Notably, Video-MSG does not need fine-tuning or attention manipulation with
additional memory during inference time, making it easier to adopt large T2V
models. Video-MSG demonstrates its effectiveness in enhancing text alignment
with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V
generation benchmarks (T2VCompBench and VBench). We provide comprehensive
ablation studies about noise inversion ratio, different background generators,
background object detection, and foreground object segmentation.Summary
AI-Generated Summary