효율적인 vDiT: 주의력을 갖춘 효율적인 비디오 확산 트랜스포머(Tile)

초록

고품질 비디오를 합성하는 것에 대한 약속에도 불구하고, 3D 전체 주의를 갖는 확산 트랜스포머(DiTs)는 주의 계산의 복잡성과 다수의 샘플링 단계로 인해 비용이 많이 발생합니다. 예를 들어, 인기 있는 Open-Sora-Plan 모델은 29프레임의 단일 비디오를 생성하는 데 9분 이상이 소요됩니다. 본 논문은 비효율성 문제를 두 가지 측면에서 다룹니다: 1) 비디오 데이터 내의 중복을 기반으로 3D 전체 주의를 가지치기합니다; 우리는 비디오 데이터의 3D 주의 맵에서 흔히 볼 수 있는 타일 형식의 반복적인 패턴을 식별하고, 비디오 프레임 수에 대해 선형 복잡도를 가진 새로운 희소 3D 주의 계열을 제안합니다. 2) 기존의 다단계 일관성 증류를 채택하여 샘플링 과정을 단축합니다; 우리는 전체 샘플링 궤적을 여러 세그먼트로 분할하고 각각에서 일관성 증류를 수행하여 몇 단계 생성 능력을 활성화합니다. 또한, 저복잡도 주의와 몇 단계 생성 능력을 결합하기 위해 세 단계의 훈련 파이프라인을 고안했습니다. 특히, 0.1% 사전 훈련 데이터로 Open-Sora-Plan-1.2 모델을 효율적인 모델로 변환하여 VBench에서 성능 희생을 최소화하면서 29프레임 및 93프레임 720p 비디오 생성에 대해 7.4배에서 7.8배 빠릅니다. 또한, 분산 추론에 적합한 접근 방식임을 입증하며, 4개의 GPU에서 시퀀스 병렬성으로 실행할 때 추가 3.91배의 가속도를 달성합니다.

English

Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to activate few-step generation capacities. We further devise a three-stage training pipeline to conjoin the low-complexity attention and few-step generation capacities. Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into an efficient one that is 7.4x -7.8x faster for 29 and 93 frames 720p video generation with a marginal performance trade-off in VBench. In addition, we demonstrate that our approach is amenable to distributed inference, achieving an additional 3.91x speedup when running on 4 GPUs with sequence parallelism.

효율적인 vDiT: 주의력을 갖춘 효율적인 비디오 확산 트랜스포머(Tile)

Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

초록

Support