清晰:类卷积线性化改进预训练扩散变换器 向上
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up
December 20, 2024
作者: Songhua Liu, Zhenxiong Tan, Xinchao Wang
cs.AI
摘要
扩散变压器(DiT)已成为图像生成中领先的架构。然而,注意力机制的二次复杂度,负责建模记号间关系,导致生成高分辨率图像时出现显著的延迟。为解决这一问题,本文旨在提出一种线性注意力机制,将预训练的DiT的复杂度降至线性。我们从现有高效注意力机制全面总结开始探索,并确定了实现预训练DiT线性化成功的四个关键因素:局部性、公式一致性、高秩注意力图和特征完整性。基于这些见解,我们引入了一种类似卷积的局部注意力策略,称为CLEAR,它将特征交互限制在每个查询记号周围的局部窗口,从而实现线性复杂度。我们的实验表明,通过仅在10K个自动生成的样本上对注意力层进行10K次迭代的微调,我们可以有效地将知识从预训练的DiT转移到具有线性复杂度的学生模型,产生与教师模型相媲美的结果。同时,它将注意力计算减少了99.5%,并加速了生成8K分辨率图像的速度6.3倍。此外,我们研究了精馏注意力层中的有利特性,如跨各种模型和插件的零次泛化以及改进了对多GPU并行推理的支持。模型和代码可在此处获得:https://github.com/Huage001/CLEAR。
English
Diffusion Transformers (DiT) have become a leading architecture in image
generation. However, the quadratic complexity of attention mechanisms, which
are responsible for modeling token-wise relationships, results in significant
latency when generating high-resolution images. To address this issue, we aim
at a linear attention mechanism in this paper that reduces the complexity of
pre-trained DiTs to linear. We begin our exploration with a comprehensive
summary of existing efficient attention mechanisms and identify four key
factors crucial for successful linearization of pre-trained DiTs: locality,
formulation consistency, high-rank attention maps, and feature integrity. Based
on these insights, we introduce a convolution-like local attention strategy
termed CLEAR, which limits feature interactions to a local window around each
query token, and thus achieves linear complexity. Our experiments indicate
that, by fine-tuning the attention layer on merely 10K self-generated samples
for 10K iterations, we can effectively transfer knowledge from a pre-trained
DiT to a student model with linear complexity, yielding results comparable to
the teacher model. Simultaneously, it reduces attention computations by 99.5%
and accelerates generation by 6.3 times for generating 8K-resolution images.
Furthermore, we investigate favorable properties in the distilled attention
layers, such as zero-shot generalization cross various models and plugins, and
improved support for multi-GPU parallel inference. Models and codes are
available here: https://github.com/Huage001/CLEAR.Summary
AI-Generated Summary