利用以输出为中心的特征描述增强自动可解释性
Enhancing Automated Interpretability with Output-Centric Feature Descriptions
January 14, 2025
作者: Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, Mor Geva
cs.AI
摘要
自动可解释性流程生成自然语言描述,用于解释大型语言模型(LLMs)中表示的特征概念,例如植物或句子中的第一个词。这些描述是使用激活该特征的输入导出的,这些输入可能是模型表示空间中的维度或方向。然而,识别激活输入是昂贵的,特征在模型行为中的机械作用取决于输入如何导致特征激活以及特征激活如何影响输出。通过使用转向评估,我们揭示了当前流程提供的描述未能捕捉特征对输出的因果效应。为了解决这个问题,我们提出了一种高效的、以输出为中心的方法来自动生成特征描述。这些方法使用在特征刺激后加权更高的标记,或者在将词汇表“unembedding”头直接应用于特征后获得的最高权重标记。我们的以输出为中心的描述更好地捕捉了特征对模型输出的因果效应,而不是以输入为中心的描述,但将两者结合在一起在输入和输出评估上表现最佳。最后,我们展示了以输出为中心的描述可用于找到激活先前被认为是“死”的特征的输入。
English
Automated interpretability pipelines generate natural language descriptions
for the concepts represented by features in large language models (LLMs), such
as plants or the first word in a sentence. These descriptions are derived using
inputs that activate the feature, which may be a dimension or a direction in
the model's representation space. However, identifying activating inputs is
costly, and the mechanistic role of a feature in model behavior is determined
both by how inputs cause a feature to activate and by how feature activation
affects outputs. Using steering evaluations, we reveal that current pipelines
provide descriptions that fail to capture the causal effect of the feature on
outputs. To fix this, we propose efficient, output-centric methods for
automatically generating feature descriptions. These methods use the tokens
weighted higher after feature stimulation or the highest weight tokens after
applying the vocabulary "unembedding" head directly to the feature. Our
output-centric descriptions better capture the causal effect of a feature on
model outputs than input-centric descriptions, but combining the two leads to
the best performance on both input and output evaluations. Lastly, we show that
output-centric descriptions can be used to find inputs that activate features
previously thought to be "dead".Summary
AI-Generated Summary