DiTCtrl：マルチモーダルディフュージョントランスフォーマーにおけるアテンション制御の探索：チューニング不要のマルチプロンプト長尺ビデオ生成

要旨

Soraのようなビデオ生成モデルは、Multi-Modal Diffusion Transformer（MM-DiT）アーキテクチャにより著しい進歩を達成しています。しかしながら、現在のビデオ生成モデルは主に単一のプロンプトに焦点を当てており、複数の連続したプロンプトを使用して現実世界の動的シナリオをより適切に反映する連続したシーンを生成するのに苦労しています。いくつかの先駆的な研究はマルチプロンプトのビデオ生成を探求していますが、厳格なトレーニングデータの要件、弱いプロンプトの追従、不自然な遷移などの重要な課題に直面しています。これらの問題に対処するために、我々は初めてMM-DiTアーキテクチャの下でトレーニング不要のマルチプロンプトビデオ生成手法であるDiTCtrlを提案します。我々の主要なアイデアは、マルチプロンプトビデオ生成タスクを滑らかな遷移を伴う時間的ビデオ編集として捉えることです。この目標を達成するために、まずMM-DiTの注意メカニズムを分析し、3DフルアテンションがUNetのような拡散モデルのクロス/セルフアテンションブロックと同様に振る舞い、マスクによる異なるプロンプト間の正確な意味的制御を可能にし、マルチプロンプトビデオ生成のためのアテンション共有を実現しています。慎重な設計に基づいて、DiTCtrlによって生成されたビデオは、追加のトレーニングなしで、複数の連続したプロンプトを使用して滑らかな遷移と一貫したオブジェクトの動きを実現しています。さらに、マルチプロンプトビデオ生成のパフォーマンスを評価するために特別に設計された新しいベンチマークであるMPVBenchを提案しています。幅広い実験により、我々の手法が追加のトレーニングなしで最先端のパフォーマンスを達成していることが示されています。

English

Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.

DiTCtrl：マルチモーダルディフュージョントランスフォーマーにおけるアテンション制御の探索：チューニング不要のマルチプロンプト長尺ビデオ生成

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

要旨

Summary

Support

Support