DiTCtrl：探索多模态扩散变压器中的注意力控制，用于无需调整的多提示长视频生成。

摘要

基于 Multi-Modal Diffusion Transformer MM-DiT 结构，类似 Sora 的视频生成模型取得了显著进展。然而，当前视频生成模型主要集中在单提示上，难以生成反映真实动态场景的多个连续提示的连贯场景。虽然一些开创性工作已经探索了多提示视频生成，但它们面临着重要挑战，包括严格的训练数据要求、弱提示跟随以及不自然的过渡。为了解决这些问题，我们首次提出了 DiTCtrl，这是一种在 MM-DiT 结构下无需训练的多提示视频生成方法。我们的关键思想是将多提示视频生成任务视为具有平滑过渡的时间视频编辑。为实现这一目标，我们首先分析了 MM-DiT 的注意机制，发现 3D 全注意力行为类似于 UNet-like 扩散模型中的交叉/自注意力块，实现了基于掩码的精确语义控制，通过多提示视频生成中的注意力共享实现跨不同提示的精确语义控制。基于我们精心设计的方法，DiTCtrl 生成的视频在没有额外训练的情况下实现了平滑过渡和一致的物体运动，给定多个连续提示。此外，我们还提出了 MPVBench，这是一个专门为多提示视频生成设计的新基准，用于评估多提示生成的性能。大量实验证明，我们的方法在无需额外训练的情况下实现了最先进的性能。

English

Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.

DiTCtrl：探索多模态扩散变压器中的注意力控制，用于无需调整的多提示长视频生成。

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

摘要

Support