Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

March 25, 2025

Authors: Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, Yongdong Zhang

cs.AI

Abstract

Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask^2DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask^2DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.

Summary

AI-Generated Summary

Paper Overview

Core Contribution

Introduces Mask²DiT, a novel approach for multi-scene long video generation using a dual-mask-based Diffusion Transformer (DiT) architecture.
Establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations.
Incorporates a symmetric binary mask at each attention layer to ensure precise segment-level textual-to-visual alignment.
Introduces a segment-level conditional mask for auto-regressive scene extension, enabling the generation of additional scenes based on existing ones.

Research Context

Builds on the success of the Diffusion Transformer (DiT) architecture in single-scene video generation, as demonstrated by Sora.
Addresses the underexplored challenge of multi-scene video generation, which has broader applications in film production, educational content, and virtual experiences.
Extends prior work on U-Net-based multi-scene video generation by leveraging the scalability and modeling capacity of the DiT architecture.

Keywords

Multi-scene video generation
Diffusion Transformer (DiT)
Text-to-video (T2V) models
Attention mechanisms
Auto-regressive scene extension

Background

Research Gap

Existing approaches to multi-scene video generation primarily rely on U-Net-based architectures, which struggle with long-range temporal dependencies and visual discontinuities.
Limited exploration of DiT-based architectures for multi-scene video generation, despite their scalability and superior performance in single-scene tasks.

Technical Challenges

Ensuring fine-grained alignment between text annotations and corresponding video segments.
Maintaining temporal coherence and visual consistency across multiple scenes.
Scaling the model to handle longer videos with a fixed number of scenes and enabling auto-regressive scene extension.

Prior Approaches

U-Net-based methods: Use multiple prompts to generate distinct scenes, often resulting in visual discontinuities.
Training-free and fine-tuning-based techniques: Improve inter-segment temporal coherence but are limited by U-Net’s scalability.
Keyframe-based approaches: Synthesize coherent keyframes but fail to account for temporal positioning and motion dynamics.

Methodology

Technical Architecture

Built on the open-sourced CogVideoX model, which encodes input videos into a one-dimensional visual token sequence using a 3D Causal VAE.
Introduces a symmetric binary mask at each attention layer to enforce one-to-one alignment between text annotations and video segments.
Implements a grouped attention mechanism to reduce memory usage and computational overhead.

Implementation Details

Concatenates text and video token sequences of multiple scenes into a unified one-dimensional sequence.
Uses a segment-level conditional mask to condition the generation of new scenes on preceding segments.
Combines pre-training on non-contiguous video segments with supervised fine-tuning on coherent multi-scene videos.

Innovation Points

Symmetric binary mask: Ensures precise alignment between text annotations and video segments while preserving temporal coherence.
Segment-level conditional mask: Enables auto-regressive scene extension by conditioning new segments on preceding ones.
Pre-training strategy: Reduces reliance on large-scale consecutive video data by training on non-contiguous segments.

Results

Experimental Setup

Pre-training dataset: 1 million video samples from Panda70M, concatenated into multi-scene videos.
Supervised fine-tuning dataset: Constructed from long-form videos segmented into coherent multi-scene combinations.
Evaluation dataset: 50 scenes with diverse prompts generated using ChatGPT.

Key Findings

Mask²DiT achieves significant improvements in visual consistency (70.95%) and sequence consistency (47.45%) compared to state-of-the-art methods.
The model excels in maintaining character consistency, background coherence, and stylistic uniformity across scenes.
Auto-regressive scene extension capability allows for seamless generation of additional scenes with high fidelity.

Limitations

Limited to generating animated videos due to training data constraints.
Motion dynamics and scene durations require further investigation for broader applicability.

Featured Papers

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei•Feb 27, 2024•612142

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang•Jan 22, 2025•3685

Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu•Dec 19, 2024•36311