DiTCtrl：探索多模擬擴散變壓器中的注意力控制，用於無需調整的多提示長視頻生成。

摘要

基於 Multi-Modal Diffusion Transformer MM-DiT 結構，類似 Sora 的影片生成模型取得了顯著進展。然而，目前的影片生成模型主要集中在單個提示上，難以生成具有多個連續提示的連貫場景，這些場景更能反映現實世界的動態情況。雖然一些開拓性工作已經探索了多提示影片生成，但它們面臨著重大挑戰，包括嚴格的訓練數據要求、弱提示跟隨以及不自然的過渡。為了解決這些問題，我們首次提出了 DiTCtrl，這是一種在 MM-DiT 結構下無需訓練的多提示影片生成方法。我們的主要想法是將多提示影片生成任務視為具有平滑過渡的時間影片編輯。為了實現這一目標，我們首先分析了 MM-DiT 的注意機制，發現 3D 全注意力的行為與 UNet-like 擴散模型中的交叉/自我注意塊類似，實現了通過注意力共享在不同提示之間進行具有遮罩引導的精確語義控制的多提示影片生成。基於我們的精心設計，DiTCtrl 生成的影片實現了平滑過渡和一致的物體運動，並給出了多個連續提示而無需額外訓練。此外，我們還提出了 MPVBench，這是一個專門設計用於多提示影片生成的新基準，用於評估多提示生成的性能。大量實驗表明，我們的方法在無需額外訓練的情況下實現了最先進的性能。

English

Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.

DiTCtrl：探索多模擬擴散變壓器中的注意力控制，用於無需調整的多提示長視頻生成。

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

摘要

Summary

Support