清晰：类卷积线性化改进预训练扩散变换器向上

摘要

扩散变压器（DiT）已成为图像生成中领先的架构。然而，注意力机制的二次复杂度，负责建模记号间关系，导致生成高分辨率图像时出现显著的延迟。为解决这一问题，本文旨在提出一种线性注意力机制，将预训练的DiT的复杂度降至线性。我们从现有高效注意力机制全面总结开始探索，并确定了实现预训练DiT线性化成功的四个关键因素：局部性、公式一致性、高秩注意力图和特征完整性。基于这些见解，我们引入了一种类似卷积的局部注意力策略，称为CLEAR，它将特征交互限制在每个查询记号周围的局部窗口，从而实现线性复杂度。我们的实验表明，通过仅在10K个自动生成的样本上对注意力层进行10K次迭代的微调，我们可以有效地将知识从预训练的DiT转移到具有线性复杂度的学生模型，产生与教师模型相媲美的结果。同时，它将注意力计算减少了99.5％，并加速了生成8K分辨率图像的速度6.3倍。此外，我们研究了精馏注意力层中的有利特性，如跨各种模型和插件的零次泛化以及改进了对多GPU并行推理的支持。模型和代码可在此处获得：https://github.com/Huage001/CLEAR。

English

Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.

清晰：类卷积线性化改进预训练扩散变换器向上

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

摘要

Support