稀疏自编码器在视觉语言模型中学习单语义特征

摘要

稀疏自编码器（SAEs）近期被证明能够提升大型语言模型（LLMs）的可解释性与可控性。在本研究中，我们将SAEs的应用扩展至视觉-语言模型（VLMs），如CLIP，并引入了一套全面的框架来评估视觉表示中的单义性。实验结果表明，在VLMs上训练的SAEs显著增强了单个神经元的单义性，同时展现出与专家定义结构（例如，iNaturalist分类体系）高度契合的层次化表示。尤为突出的是，我们展示了通过SAEs干预CLIP视觉编码器，无需对底层模型进行任何修改，即可直接引导多模态LLMs（如LLaVA）的输出。这些发现凸显了SAEs作为一种无监督方法，在增强VLMs可解释性与控制力方面的实用性与有效性。

English

Sparse Autoencoders (SAEs) have recently been shown to enhance interpretability and steerability in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons while also exhibiting hierarchical representations that align well with expert-defined structures (e.g., iNaturalist taxonomy). Most notably, we demonstrate that applying SAEs to intervene on a CLIP vision encoder, directly steer output from multimodal LLMs (e.g., LLaVA) without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised approach for enhancing both the interpretability and control of VLMs.

稀疏自编码器在视觉语言模型中学习单语义特征

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

摘要

Summary

Support

Support