通過以輸出為中心的特徵描述來增強自動可解釋性

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

January 14, 2025
作者: Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, Mor Geva
cs.AI

摘要

自動可解釋性流程生成自然語言描述,以描述大型語言模型(LLMs)中表示的特徵概念,例如植物或句子中的第一個單詞。這些描述是使用激活該特徵的輸入來衍生的,這些輸入可能是模型表示空間中的一個維度或方向。然而,識別激活輸入是昂貴的,而特徵在模型行為中的機械作用既取決於輸入如何導致特徵激活,也取決於特徵激活如何影響輸出。通過轉向評估,我們揭示了當前流程提供的描述未能捕捉特徵對輸出的因果效應。為了解決這個問題,我們提出了有效的、以輸出為中心的方法來自動生成特徵描述。這些方法使用在特徵刺激後權重較高的標記,或者在將詞彙“unembedding”頭直接應用於特徵後權重最高的標記。我們的以輸出為中心的描述更好地捕捉了特徵對模型輸出的因果效應,但將兩者結合則在輸入和輸出評估上實現了最佳表現。最後,我們展示了以輸出為中心的描述可用於找到先前被認為是“死”的特徵激活的輸入。
English
Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model's representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary "unembedding" head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be "dead".

Summary

AI-Generated Summary

PDF102January 15, 2025