稀疏自编码器用于科学严谨解释视觉模型
Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models
February 10, 2025
作者: Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su
cs.AI
摘要
为了真正理解视觉模型,我们不仅需要解释它们学到的特征,还需要通过控制实验来验证这些解释。当前的方法要么提供可解释的特征但无法测试其因果影响,要么允许模型编辑但缺乏可解释的控制。我们提出了一个使用稀疏自动编码器(SAEs)的统一框架来弥合这一差距,使我们能够发现人类可解释的视觉特征,并精确地操纵它们以测试关于模型行为的假设。通过将我们的方法应用于最先进的视觉模型,我们揭示了具有不同预训练目标的模型学到的语义抽象中的关键差异。然后,我们通过对多个视觉任务进行控制干预展示了我们框架的实际用途。我们展示了SAEs能够可靠地识别和操纵可解释的视觉特征而无需重新训练模型,为理解和控制视觉模型行为提供了强大工具。我们在项目网站上提供了代码、演示和模型:https://osu-nlp-group.github.io/SAE-V。
English
To truly understand vision models, we must not only interpret their learned
features but also validate these interpretations through controlled
experiments. Current approaches either provide interpretable features without
the ability to test their causal influence, or enable model editing without
interpretable controls. We present a unified framework using sparse
autoencoders (SAEs) that bridges this gap, allowing us to discover
human-interpretable visual features and precisely manipulate them to test
hypotheses about model behavior. By applying our method to state-of-the-art
vision models, we reveal key differences in the semantic abstractions learned
by models with different pre-training objectives. We then demonstrate the
practical usage of our framework through controlled interventions across
multiple vision tasks. We show that SAEs can reliably identify and manipulate
interpretable visual features without model re-training, providing a powerful
tool for understanding and controlling vision model behavior. We provide code,
demos and models on our project website: https://osu-nlp-group.github.io/SAE-V.Summary
AI-Generated Summary