解读SDXL Turbo：利用稀疏自编码器解释文本到图像模型

摘要

稀疏自编码器（SAEs）已成为逆向工程大型语言模型（LLMs）的核心要素。对于LLMs，它们已被证明可以将通常无法直接解释的中间表示分解为可解释特征的稀疏总和，有助于更好地控制和随后的分析。然而，对于文本到图像模型，类似的分析和方法却缺乏。我们调查了使用SAEs学习可解释特征的可能性，用于几步文本到图像扩散模型，如SDXL Turbo。为此，我们在SDXL Turbo的去噪U-net中训练SAEs，以学习变压器块执行的更新。我们发现它们学到的特征是可解释的，对生成过程产生因果影响，并揭示了块之间的专业化。特别是，我们发现一个块主要处理图像构图，一个主要负责添加局部细节，另一个负责颜色、照明和风格。因此，我们的工作是更好地理解生成式文本到图像模型（如SDXL Turbo）内部机制的重要第一步，展示了SAEs学到的特征在视觉领域的潜力。代码可在https://github.com/surkovv/sdxl-unbox找到。

English

Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of large-language models (LLMs). For LLMs, they have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigated the possibility of using SAEs to learn interpretable features for a few-step text-to-image diffusion models, such as SDXL Turbo. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net. We find that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. In particular, we find one block that deals mainly with image composition, one that is mainly responsible for adding local details, and one for color, illumination, and style. Therefore, our work is an important first step towards better understanding the internals of generative text-to-image models like SDXL Turbo and showcases the potential of features learned by SAEs for the visual domain. Code is available at https://github.com/surkovv/sdxl-unbox

解读SDXL Turbo：利用稀疏自编码器解释文本到图像模型

Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

摘要

Summary

Support

Support