DiTCtrl:探索多模态扩散变压器中的注意力控制,用于无需调整的多提示长视频生成。
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
December 24, 2024
作者: Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, Xiangyu Yue
cs.AI
摘要
基于 Multi-Modal Diffusion Transformer MM-DiT 结构,类似 Sora 的视频生成模型取得了显著进展。然而,当前视频生成模型主要集中在单提示上,难以生成反映真实动态场景的多个连续提示的连贯场景。虽然一些开创性工作已经探索了多提示视频生成,但它们面临着重要挑战,包括严格的训练数据要求、弱提示跟随以及不自然的过渡。为了解决这些问题,我们首次提出了 DiTCtrl,这是一种在 MM-DiT 结构下无需训练的多提示视频生成方法。我们的关键思想是将多提示视频生成任务视为具有平滑过渡的时间视频编辑。为实现这一目标,我们首先分析了 MM-DiT 的注意机制,发现 3D 全注意力行为类似于 UNet-like 扩散模型中的交叉/自注意力块,实现了基于掩码的精确语义控制,通过多提示视频生成中的注意力共享实现跨不同提示的精确语义控制。基于我们精心设计的方法,DiTCtrl 生成的视频在没有额外训练的情况下实现了平滑过渡和一致的物体运动,给定多个连续提示。此外,我们还提出了 MPVBench,这是一个专门为多提示视频生成设计的新基准,用于评估多提示生成的性能。大量实验证明,我们的方法在无需额外训练的情况下实现了最先进的性能。
English
Sora-like video generation models have achieved remarkable progress with a
Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current
video generation models predominantly focus on single-prompt, struggling to
generate coherent scenes with multiple sequential prompts that better reflect
real-world dynamic scenarios. While some pioneering works have explored
multi-prompt video generation, they face significant challenges including
strict training data requirements, weak prompt following, and unnatural
transitions. To address these problems, we propose DiTCtrl, a training-free
multi-prompt video generation method under MM-DiT architectures for the first
time. Our key idea is to take the multi-prompt video generation task as
temporal video editing with smooth transitions. To achieve this goal, we first
analyze MM-DiT's attention mechanism, finding that the 3D full attention
behaves similarly to that of the cross/self-attention blocks in the UNet-like
diffusion models, enabling mask-guided precise semantic control across
different prompts with attention sharing for multi-prompt video generation.
Based on our careful design, the video generated by DiTCtrl achieves smooth
transitions and consistent object motion given multiple sequential prompts
without additional training. Besides, we also present MPVBench, a new benchmark
specially designed for multi-prompt video generation to evaluate the
performance of multi-prompt generation. Extensive experiments demonstrate that
our method achieves state-of-the-art performance without additional training.Summary
AI-Generated Summary