Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
Abstract
Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask^2DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask^2DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.
Summary
AI-Generated Summary
Paper Overview
Core Contribution
- Introduces Mask²DiT, a novel approach for multi-scene long video generation using a dual-mask-based Diffusion Transformer (DiT) architecture.
- Establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations.
- Incorporates a symmetric binary mask at each attention layer to ensure precise segment-level textual-to-visual alignment.
- Introduces a segment-level conditional mask for auto-regressive scene extension, enabling the generation of additional scenes based on existing ones.
Research Context
- Builds on the success of the Diffusion Transformer (DiT) architecture in single-scene video generation, as demonstrated by Sora.
- Addresses the underexplored challenge of multi-scene video generation, which has broader applications in film production, educational content, and virtual experiences.
- Extends prior work on U-Net-based multi-scene video generation by leveraging the scalability and modeling capacity of the DiT architecture.
Keywords
- Multi-scene video generation
- Diffusion Transformer (DiT)
- Text-to-video (T2V) models
- Attention mechanisms
- Auto-regressive scene extension
Background
Research Gap
- Existing approaches to multi-scene video generation primarily rely on U-Net-based architectures, which struggle with long-range temporal dependencies and visual discontinuities.
- Limited exploration of DiT-based architectures for multi-scene video generation, despite their scalability and superior performance in single-scene tasks.
Technical Challenges
- Ensuring fine-grained alignment between text annotations and corresponding video segments.
- Maintaining temporal coherence and visual consistency across multiple scenes.
- Scaling the model to handle longer videos with a fixed number of scenes and enabling auto-regressive scene extension.
Prior Approaches
- U-Net-based methods: Use multiple prompts to generate distinct scenes, often resulting in visual discontinuities.
- Training-free and fine-tuning-based techniques: Improve inter-segment temporal coherence but are limited by U-Net’s scalability.
- Keyframe-based approaches: Synthesize coherent keyframes but fail to account for temporal positioning and motion dynamics.
Methodology
Technical Architecture
- Built on the open-sourced CogVideoX model, which encodes input videos into a one-dimensional visual token sequence using a 3D Causal VAE.
- Introduces a symmetric binary mask at each attention layer to enforce one-to-one alignment between text annotations and video segments.
- Implements a grouped attention mechanism to reduce memory usage and computational overhead.
Implementation Details
- Concatenates text and video token sequences of multiple scenes into a unified one-dimensional sequence.
- Uses a segment-level conditional mask to condition the generation of new scenes on preceding segments.
- Combines pre-training on non-contiguous video segments with supervised fine-tuning on coherent multi-scene videos.
Innovation Points
- Symmetric binary mask: Ensures precise alignment between text annotations and video segments while preserving temporal coherence.
- Segment-level conditional mask: Enables auto-regressive scene extension by conditioning new segments on preceding ones.
- Pre-training strategy: Reduces reliance on large-scale consecutive video data by training on non-contiguous segments.
Results
Experimental Setup
- Pre-training dataset: 1 million video samples from Panda70M, concatenated into multi-scene videos.
- Supervised fine-tuning dataset: Constructed from long-form videos segmented into coherent multi-scene combinations.
- Evaluation dataset: 50 scenes with diverse prompts generated using ChatGPT.
Key Findings
- Mask²DiT achieves significant improvements in visual consistency (70.95%) and sequence consistency (47.45%) compared to state-of-the-art methods.
- The model excels in maintaining character consistency, background coherence, and stylistic uniformity across scenes.
- Auto-regressive scene extension capability allows for seamless generation of additional scenes with high fidelity.
Limitations
- Limited to generating animated videos due to training data constraints.
- Motion dynamics and scene durations require further investigation for broader applicability.