CLEAR: 사전 훈련된 확산 트랜스포머의 Conv-Like 선형화 역전

초록

확산 트랜스포머(Diffusion Transformers, DiT)는 이미지 생성에서 선도적인 아키텍처가 되었습니다. 그러나 토큰 간 관계를 모델링하는 데 책임이 있는 어텐션 메커니즘의 이차 복잡성은 고해상도 이미지를 생성할 때 상당한 지연을 초래합니다. 본 논문에서는 이 문제에 대응하기 위해 사전 훈련된 DiT의 복잡성을 선형으로 줄이는 선형 어텐션 메커니즘을 목표로 합니다. 우리는 기존의 효율적인 어텐션 메커니즘을 포괄적으로 요약하고, 사전 훈련된 DiT의 선형화에 성공적인 네 가지 핵심 요소를 식별합니다: 국소성, 공식 일관성, 고랭크 어텐션 맵, 그리고 피처 무결성. 이러한 통찰력을 기반으로, 우리는 각 쿼리 토큰 주변의 지역 창에 피처 상호 작용을 제한하여 선형 복잡성을 달성하는 컨볼루션과 유사한 로컬 어텐션 전략인 CLEAR을 소개합니다. 실험 결과, 사전 훈련된 DiT에서 단순히 10,000개의 자체 생성 샘플에 대해 10,000번의 반복을 통해 어텐션 레이어를 세밀하게 조정함으로써, 선형 복잡성을 갖는 학생 모델로 지식을 효과적으로 전이할 수 있으며, 선생 모델과 유사한 결과를 얻을 수 있음을 보여줍니다. 동시에, 8K 해상도 이미지를 생성하는 데 99.5%의 어텐션 계산을 줄이고 생성 속도를 6.3배 가속화합니다. 더불어, 증류된 어텐션 레이어에서의 유리한 특성을 조사하였는데, 이는 다양한 모델 및 플러그인 간 제로샷 일반화와 멀티 GPU 병렬 추론에 대한 향상된 지원을 포함합니다. 모델 및 코드는 여기에서 확인할 수 있습니다: https://github.com/Huage001/CLEAR.

English

Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.

CLEAR: 사전 훈련된 확산 트랜스포머의 Conv-Like 선형화 역전

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

초록

Support