长上下文调优用于视频生成
Long Context Tuning for Video Generation
March 13, 2025
作者: Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, Lu Jiang
cs.AI
摘要
近期视频生成技术的进步已能通过可扩展的扩散变换器制作出逼真、长达一分钟的单镜头视频。然而,现实世界的叙事视频需要多镜头场景,且各镜头间需保持视觉与动态的一致性。本研究中,我们提出了长上下文调优(Long Context Tuning, LCT),一种训练范式,它扩展了预训练单镜头视频扩散模型的上下文窗口,直接从数据中学习场景级别的一致性。我们的方法将全注意力机制从单个镜头扩展至涵盖场景内的所有镜头,结合交错的三维位置嵌入与异步噪声策略,实现了无需额外参数的联合与自回归镜头生成。经过LCT后具备双向注意力的模型,可进一步通过上下文因果注意力进行微调,促进利用高效KV缓存的自回归生成。实验表明,经过LCT的单镜头模型能够生成连贯的多镜头场景,并展现出包括组合生成与交互式镜头延伸在内的新兴能力,为更实用的视觉内容创作铺平了道路。更多详情请访问https://guoyww.github.io/projects/long-context-video/。
English
Recent advances in video generation can produce realistic, minute-long
single-shot videos with scalable diffusion transformers. However, real-world
narrative videos require multi-shot scenes with visual and dynamic consistency
across shots. In this work, we introduce Long Context Tuning (LCT), a training
paradigm that expands the context window of pre-trained single-shot video
diffusion models to learn scene-level consistency directly from data. Our
method expands full attention mechanisms from individual shots to encompass all
shots within a scene, incorporating interleaved 3D position embedding and an
asynchronous noise strategy, enabling both joint and auto-regressive shot
generation without additional parameters. Models with bidirectional attention
after LCT can further be fine-tuned with context-causal attention, facilitating
auto-regressive generation with efficient KV-cache. Experiments demonstrate
single-shot models after LCT can produce coherent multi-shot scenes and exhibit
emerging capabilities, including compositional generation and interactive shot
extension, paving the way for more practical visual content creation. See
https://guoyww.github.io/projects/long-context-video/ for more details.Summary
AI-Generated Summary