ChatPaper.aiChatPaper

无训练指导的文本到视频生成:通过多模态规划与结构化噪声初始化实现

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

April 11, 2025
作者: Jialu Li, Shoubin Yu, Han Lin, Jaemin Cho, Jaehong Yoon, Mohit Bansal
cs.AI

摘要

近期,文本到视频(T2V)扩散模型的进展显著提升了生成视频的视觉质量。然而,即便是最新的T2V模型,在准确遵循文本描述方面仍面临挑战,特别是当提示需要精确控制空间布局或物体轨迹时。最近的一项研究采用布局引导方法,这些方法需要在推理时进行微调或对注意力图进行迭代操作,这大大增加了内存需求,使得难以采用大型T2V模型作为骨干。为解决这一问题,我们提出了Video-MSG,一种基于多模态规划和结构化噪声初始化的免训练T2V生成引导方法。Video-MSG包含三个步骤:在前两步中,它创建视频草图,为最终视频制定细粒度的时空计划,以草稿视频帧的形式指定背景、前景及物体轨迹;最后一步,Video-MSG通过噪声反转和去噪,利用视频草图引导下游的T2V扩散模型。值得注意的是,Video-MSG在推理时无需微调或额外的注意力图操作,从而更易于采用大型T2V模型。Video-MSG在多个T2V骨干模型(VideoCrafter2和CogVideoX-5B)上,在流行的T2V生成基准测试(T2VCompBench和VBench)中,展现了其在增强文本对齐方面的有效性。我们提供了关于噪声反转比例、不同背景生成器、背景物体检测及前景物体分割的全面消融研究。
English
Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.

Summary

AI-Generated Summary

PDF72April 14, 2025