稀疏自编码器用于科学严谨解释视觉模型

摘要

为了真正理解视觉模型，我们不仅需要解释它们学到的特征，还需要通过控制实验来验证这些解释。当前的方法要么提供可解释的特征但无法测试其因果影响，要么允许模型编辑但缺乏可解释的控制。我们提出了一个使用稀疏自动编码器（SAEs）的统一框架来弥合这一差距，使我们能够发现人类可解释的视觉特征，并精确地操纵它们以测试关于模型行为的假设。通过将我们的方法应用于最先进的视觉模型，我们揭示了具有不同预训练目标的模型学到的语义抽象中的关键差异。然后，我们通过对多个视觉任务进行控制干预展示了我们框架的实际用途。我们展示了SAEs能够可靠地识别和操纵可解释的视觉特征而无需重新训练模型，为理解和控制视觉模型行为提供了强大工具。我们在项目网站上提供了代码、演示和模型：https://osu-nlp-group.github.io/SAE-V。

English

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. Current approaches either provide interpretable features without the ability to test their causal influence, or enable model editing without interpretable controls. We present a unified framework using sparse autoencoders (SAEs) that bridges this gap, allowing us to discover human-interpretable visual features and precisely manipulate them to test hypotheses about model behavior. By applying our method to state-of-the-art vision models, we reveal key differences in the semantic abstractions learned by models with different pre-training objectives. We then demonstrate the practical usage of our framework through controlled interventions across multiple vision tasks. We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training, providing a powerful tool for understanding and controlling vision model behavior. We provide code, demos and models on our project website: https://osu-nlp-group.github.io/SAE-V.

稀疏自编码器用于科学严谨解释视觉模型

Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models

摘要

Summary

Support