无需调参的多事件长视频生成:基于同步耦合采样的方法
Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling
March 11, 2025
作者: Subin Kim, Seoung Wug Oh, Jui-Hsien Wang, Joon-Young Lee, Jinwoo Shin
cs.AI
摘要
尽管近期文本到视频扩散模型的进展使得从单一提示生成高质量短视频成为可能,但在单次生成中创建现实世界的长视频仍面临数据有限和计算成本高的挑战。为解决这一问题,多项研究提出了无需调优的方法,即扩展现有模型以生成长视频,特别是通过使用多个提示来实现动态且可控的内容变化。然而,这些方法主要侧重于确保相邻帧之间的平滑过渡,往往导致内容漂移和语义连贯性在较长序列中逐渐丧失。针对此问题,我们提出了同步耦合采样(SynCoS),一种新颖的推理框架,它同步整个视频的去噪路径,确保相邻及远距离帧之间的长期一致性。我们的方法结合了两种互补的采样策略:反向采样和基于优化的采样,分别保证了局部过渡的无缝性和全局一致性的强化。然而,直接交替使用这两种采样会导致去噪轨迹错位,破坏提示引导并引入非预期的内容变化,因为它们独立运作。为解决这一问题,SynCoS通过一个固定的时间步长和基线噪声实现同步,确保采样完全耦合且去噪路径对齐。大量实验表明,SynCoS在多事件长视频生成方面显著提升,实现了更平滑的过渡和更优的长期一致性,在定量和定性评估上均超越了先前的方法。
English
While recent advancements in text-to-video diffusion models enable
high-quality short video generation from a single prompt, generating real-world
long videos in a single pass remains challenging due to limited data and high
computational costs. To address this, several works propose tuning-free
approaches, i.e., extending existing models for long video generation,
specifically using multiple prompts to allow for dynamic and controlled content
changes. However, these methods primarily focus on ensuring smooth transitions
between adjacent frames, often leading to content drift and a gradual loss of
semantic coherence over longer sequences. To tackle such an issue, we propose
Synchronized Coupled Sampling (SynCoS), a novel inference framework that
synchronizes denoising paths across the entire video, ensuring long-range
consistency across both adjacent and distant frames. Our approach combines two
complementary sampling strategies: reverse and optimization-based sampling,
which ensure seamless local transitions and enforce global coherence,
respectively. However, directly alternating between these samplings misaligns
denoising trajectories, disrupting prompt guidance and introducing unintended
content changes as they operate independently. To resolve this, SynCoS
synchronizes them through a grounded timestep and a fixed baseline noise,
ensuring fully coupled sampling with aligned denoising paths. Extensive
experiments show that SynCoS significantly improves multi-event long video
generation, achieving smoother transitions and superior long-range coherence,
outperforming previous approaches both quantitatively and qualitatively.Summary
AI-Generated Summary