深入探討 SDXL Turbo：使用稀疏自編碼器解釋文本到圖像模型

摘要

稀疏自編碼器（SAEs）已成為大型語言模型（LLMs）逆向工程的核心要素。對於LLMs，它們已被證明能夠將通常無法直接解釋的中間表示分解為可解釋特徵的稀疏總和，有助於更好地控制和後續分析。然而，對於文本到圖像模型，類似的分析和方法卻缺乏。我們探討了使用SAEs來學習幾步文本到圖像擴散模型（如SDXL Turbo）的可解釋特徵的可能性。為此，我們在SDXL Turbo的去噪U-net中訓練SAEs，以學習變壓器塊執行的更新。我們發現它們學習到的特徵是可解釋的，對生成過程有因果影響，並顯示出塊之間的專業化。特別是，我們發現一個塊主要處理圖像構圖，一個主要負責添加局部細節，還有一個處理顏色、照明和風格。因此，我們的工作是更好地理解SDXL Turbo等生成式文本到圖像模型內部運作的重要第一步，並展示了SAEs學習到的特徵在視覺領域的潛力。程式碼可在https://github.com/surkovv/sdxl-unbox找到。

English

Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of large-language models (LLMs). For LLMs, they have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigated the possibility of using SAEs to learn interpretable features for a few-step text-to-image diffusion models, such as SDXL Turbo. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net. We find that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. In particular, we find one block that deals mainly with image composition, one that is mainly responsible for adding local details, and one for color, illumination, and style. Therefore, our work is an important first step towards better understanding the internals of generative text-to-image models like SDXL Turbo and showcases the potential of features learned by SAEs for the visual domain. Code is available at https://github.com/surkovv/sdxl-unbox

深入探討 SDXL Turbo：使用稀疏自編碼器解釋文本到圖像模型

Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

摘要

Summary

Support

Support