DiTCtrl:探索多模擬擴散變壓器中的注意力控制,用於無需調整的多提示長視頻生成。
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
December 24, 2024
作者: Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, Xiangyu Yue
cs.AI
摘要
基於 Multi-Modal Diffusion Transformer MM-DiT 結構,類似 Sora 的影片生成模型取得了顯著進展。然而,目前的影片生成模型主要集中在單個提示上,難以生成具有多個連續提示的連貫場景,這些場景更能反映現實世界的動態情況。雖然一些開拓性工作已經探索了多提示影片生成,但它們面臨著重大挑戰,包括嚴格的訓練數據要求、弱提示跟隨以及不自然的過渡。為了解決這些問題,我們首次提出了 DiTCtrl,這是一種在 MM-DiT 結構下無需訓練的多提示影片生成方法。我們的主要想法是將多提示影片生成任務視為具有平滑過渡的時間影片編輯。為了實現這一目標,我們首先分析了 MM-DiT 的注意機制,發現 3D 全注意力的行為與 UNet-like 擴散模型中的交叉/自我注意塊類似,實現了通過注意力共享在不同提示之間進行具有遮罩引導的精確語義控制的多提示影片生成。基於我們的精心設計,DiTCtrl 生成的影片實現了平滑過渡和一致的物體運動,並給出了多個連續提示而無需額外訓練。此外,我們還提出了 MPVBench,這是一個專門設計用於多提示影片生成的新基準,用於評估多提示生成的性能。大量實驗表明,我們的方法在無需額外訓練的情況下實現了最先進的性能。
English
Sora-like video generation models have achieved remarkable progress with a
Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current
video generation models predominantly focus on single-prompt, struggling to
generate coherent scenes with multiple sequential prompts that better reflect
real-world dynamic scenarios. While some pioneering works have explored
multi-prompt video generation, they face significant challenges including
strict training data requirements, weak prompt following, and unnatural
transitions. To address these problems, we propose DiTCtrl, a training-free
multi-prompt video generation method under MM-DiT architectures for the first
time. Our key idea is to take the multi-prompt video generation task as
temporal video editing with smooth transitions. To achieve this goal, we first
analyze MM-DiT's attention mechanism, finding that the 3D full attention
behaves similarly to that of the cross/self-attention blocks in the UNet-like
diffusion models, enabling mask-guided precise semantic control across
different prompts with attention sharing for multi-prompt video generation.
Based on our careful design, the video generated by DiTCtrl achieves smooth
transitions and consistent object motion given multiple sequential prompts
without additional training. Besides, we also present MPVBench, a new benchmark
specially designed for multi-prompt video generation to evaluate the
performance of multi-prompt generation. Extensive experiments demonstrate that
our method achieves state-of-the-art performance without additional training.Summary
AI-Generated Summary