SDXL Turbo의 해석: 희소 오토인코더를 사용한 텍스트에서 이미지로 모델 해석

초록

희소 오토인코더(SAEs)는 대형 언어 모델(Large-Language Models, LLMs)의 역공학에서 중요한 구성 요소가 되었습니다. LLMs의 경우, SAEs는 종종 직접 해석할 수 없는 중간 표현을 희소한 해석 가능한 특징들의 합으로 분해하여 더 나은 제어와 후속 분석을 용이하게 합니다. 그러나 텍스트-이미지 모델에 대해 유사한 분석과 접근 방식이 부족했습니다. 저희는 SDXL Turbo와 같은 몇 단계의 텍스트-이미지 확산 모델에서 해석 가능한 특징을 학습하는 데 SAEs를 사용할 수 있는 가능성을 조사했습니다. 이를 위해, 우리는 SDXL Turbo의 소음 제거 U-net 내에서 트랜스포머 블록에 의해 수행된 업데이트에 대해 SAEs를 훈련시켰습니다. 우리는 그들이 학습한 특징이 해석 가능하며 생성 프로세스에 인과적으로 영향을 주며 블록들 사이에 전문화를 드러낸다는 것을 발견했습니다. 특히, 이미지 구성에 주로 관여하는 블록, 지역적 세부 사항을 주로 추가하는 블록, 색상, 조명 및 스타일에 대한 블록을 발견했습니다. 따라서 우리의 연구는 SDXL Turbo와 같은 생성적 텍스트-이미지 모델의 내부를 더 잘 이해하기 위한 중요한 첫걸음이며, SAEs에 의해 학습된 특징들이 시각 도메인에 대한 잠재력을 보여줍니다. 코드는 https://github.com/surkovv/sdxl-unbox에서 확인할 수 있습니다.

English

Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of large-language models (LLMs). For LLMs, they have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigated the possibility of using SAEs to learn interpretable features for a few-step text-to-image diffusion models, such as SDXL Turbo. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net. We find that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. In particular, we find one block that deals mainly with image composition, one that is mainly responsible for adding local details, and one for color, illumination, and style. Therefore, our work is an important first step towards better understanding the internals of generative text-to-image models like SDXL Turbo and showcases the potential of features learned by SAEs for the visual domain. Code is available at https://github.com/surkovv/sdxl-unbox

SDXL Turbo의 해석: 희소 오토인코더를 사용한 텍스트에서 이미지로 모델 해석

Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

초록

Summary

Support