高效vDiT: 带有注意力瓦片的高效视频扩散变压器
Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile
February 10, 2025
作者: Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, Hao Zhang
cs.AI
摘要
尽管承诺合成高保真视频的Diffusion Transformers (DiTs) 具有3D全注意力,但由于注意力计算的复杂性和大量采样步骤,推断代价昂贵。例如,流行的Open-Sora-Plan模型生成29帧视频需要超过9分钟。本文从两个方面解决了低效率问题:1) 基于视频数据内部冗余修剪3D全注意力;我们在视频数据的3D注意力图中识别到一种普遍的瓷砖式重复模式,并提倡一种新的稀疏3D注意力家族,其复杂度与视频帧数成线性关系。2) 通过采用现有的多步一致性蒸馏缩短采样过程;我们将整个采样轨迹分成几个段,并在每个段内执行一致性蒸馏,以激活少步生成能力。我们进一步设计了一个三阶段训练流程,将低复杂度注意力和少步生成能力结合起来。值得注意的是,我们通过使用0.1%的预训练数据,将Open-Sora-Plan-1.2模型转变为一个高效模型,对于生成29和93帧720p视频,速度提高了7.4倍至7.8倍,性能上略有牺牲。此外,我们证明我们的方法适用于分布式推断,在4个GPU上运行时,通过序列并行性获得额外的3.91倍加速。
English
Despite the promise of synthesizing high-fidelity videos, Diffusion
Transformers (DiTs) with 3D full attention suffer from expensive inference due
to the complexity of attention computation and numerous sampling steps. For
example, the popular Open-Sora-Plan model consumes more than 9 minutes for
generating a single video of 29 frames. This paper addresses the inefficiency
issue from two aspects: 1) Prune the 3D full attention based on the redundancy
within video data; We identify a prevalent tile-style repetitive pattern in the
3D attention maps for video data, and advocate a new family of sparse 3D
attention that holds a linear complexity w.r.t. the number of video frames. 2)
Shorten the sampling process by adopting existing multi-step consistency
distillation; We split the entire sampling trajectory into several segments and
perform consistency distillation within each one to activate few-step
generation capacities. We further devise a three-stage training pipeline to
conjoin the low-complexity attention and few-step generation capacities.
Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into
an efficient one that is 7.4x -7.8x faster for 29 and 93 frames 720p video
generation with a marginal performance trade-off in VBench. In addition, we
demonstrate that our approach is amenable to distributed inference, achieving
an additional 3.91x speedup when running on 4 GPUs with sequence parallelism.Summary
AI-Generated Summary