CLEAR: 事前学習された拡散トランスフォーマーのためのConvolution様線形化の逆変換 Up

要旨

拡散トランスフォーマー（DiT）は画像生成において主要なアーキテクチャとなっています。しかし、トークン間の関係をモデリングするための注意機構の二次の複雑さは、高解像度の画像を生成する際に著しい遅延をもたらします。この問題に対処するため、本論文では、事前学習されたDiTの複雑さを線形にする線形注意機構を目指します。既存の効率的な注意機構に関する包括的な要約から探索を開始し、事前学習されたDiTを線形化するための成功の鍵となる4つの要素、つまり局所性、形式の一貫性、高ランクの注意マップ、および特徴の整合性を特定します。これらの知見に基づき、クエリトークンの周囲の局所ウィンドウに特徴の相互作用を制限することで線形の複雑さを実現する、畳み込みに似たローカルな注意戦略であるCLEARを紹介します。実験結果は、事前学習されたDiTから学習モデルに知識を効果的に転移させるために、わずか10,000の自己生成サンプルで10,000回のイテレーションで注意層を微調整することで、線形の複雑さを持つ学習モデルを実現し、教師モデルと比較可能な結果を得ることができることを示しています。同時に、注意の計算を99.5%削減し、8K解像度の画像を生成する際の生成を6.3倍高速化します。さらに、蒸留された注意層の有利な特性について調査し、ゼロショットの汎用性、さまざまなモデルやプラグインに対する改善されたサポート、およびマルチGPU並列推論向けの向上したサポートなどを検討します。モデルとコードはこちらで入手できます：https://github.com/Huage001/CLEAR。

English

Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.

CLEAR: 事前学習された拡散トランスフォーマーのためのConvolution様線形化の逆変換 Up

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

要旨

Summary

Support

Support