稀疏自编码器在视觉语言模型中学习单语义特征
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
April 3, 2025
作者: Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, Zeynep Akata
cs.AI
摘要
稀疏自编码器(SAEs)近期被证明能够提升大型语言模型(LLMs)的可解释性与可控性。在本研究中,我们将SAEs的应用扩展至视觉-语言模型(VLMs),如CLIP,并引入了一套全面的框架来评估视觉表示中的单义性。实验结果表明,在VLMs上训练的SAEs显著增强了单个神经元的单义性,同时展现出与专家定义结构(例如,iNaturalist分类体系)高度契合的层次化表示。尤为突出的是,我们展示了通过SAEs干预CLIP视觉编码器,无需对底层模型进行任何修改,即可直接引导多模态LLMs(如LLaVA)的输出。这些发现凸显了SAEs作为一种无监督方法,在增强VLMs可解释性与控制力方面的实用性与有效性。
English
Sparse Autoencoders (SAEs) have recently been shown to enhance
interpretability and steerability in Large Language Models (LLMs). In this
work, we extend the application of SAEs to Vision-Language Models (VLMs), such
as CLIP, and introduce a comprehensive framework for evaluating monosemanticity
in vision representations. Our experimental results reveal that SAEs trained on
VLMs significantly enhance the monosemanticity of individual neurons while also
exhibiting hierarchical representations that align well with expert-defined
structures (e.g., iNaturalist taxonomy). Most notably, we demonstrate that
applying SAEs to intervene on a CLIP vision encoder, directly steer output from
multimodal LLMs (e.g., LLaVA) without any modifications to the underlying
model. These findings emphasize the practicality and efficacy of SAEs as an
unsupervised approach for enhancing both the interpretability and control of
VLMs.Summary
AI-Generated Summary