희소 오토인코더를 이용한 시각 모델의 과학적으로 엄격한 해석

초록

시각 모델을 실제로 이해하기 위해서는 그들이 학습한 특징을 해석하는 것뿐만 아니라 이러한 해석을 통제된 실험을 통해 검증해야 합니다. 현재의 접근 방식은 해석 가능한 특징을 제공하지만 인과적 영향을 테스트할 수 있는 능력이 없거나 해석 가능한 컨트롤 없이 모델 편집을 가능하게 합니다. 우리는 이 간극을 메우는 희소 오토인코더(SAE)를 사용한 통합된 프레임워크를 제시하여 인간이 이해할 수 있는 시각적 특징을 발견하고 이를 정밀하게 조작하여 모델 행동에 대한 가설을 테스트할 수 있게 합니다. 최첨단 시각 모델에 우리의 방법을 적용하여, 서로 다른 사전 훈련 목표를 가진 모델들이 학습한 의미론적 추상화에서 주요 차이점을 밝히고 있습니다. 그런 다음 여러 시각 작업에 걸쳐 통제된 개입을 통해 우리의 프레임워크의 실용적 사용법을 시연합니다. SAE가 모델 재훈련 없이 해석 가능한 시각적 특징을 신뢰성 있게 식별하고 조작할 수 있음을 보여주며, 시각 모델 행동을 이해하고 제어하는 강력한 도구를 제공합니다. 프로젝트 웹사이트인 https://osu-nlp-group.github.io/SAE-V에서 코드, 데모 및 모델을 제공합니다.

English

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. Current approaches either provide interpretable features without the ability to test their causal influence, or enable model editing without interpretable controls. We present a unified framework using sparse autoencoders (SAEs) that bridges this gap, allowing us to discover human-interpretable visual features and precisely manipulate them to test hypotheses about model behavior. By applying our method to state-of-the-art vision models, we reveal key differences in the semantic abstractions learned by models with different pre-training objectives. We then demonstrate the practical usage of our framework through controlled interventions across multiple vision tasks. We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training, providing a powerful tool for understanding and controlling vision model behavior. We provide code, demos and models on our project website: https://osu-nlp-group.github.io/SAE-V.

희소 오토인코더를 이용한 시각 모델의 과학적으로 엄격한 해석

Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models

초록

Support