出力中心の特徴記述による自動解釈性の向上

要旨

自動解釈パイプラインは、大規模言語モデル（LLMs）における特徴が表す概念についての自然言語の説明を生成します。これらの説明は、特徴を活性化させる入力を用いて導出され、これらの入力はモデルの表現空間における次元または方向である可能性があります。ただし、活性化入力を特定することはコストがかかり、モデルの振る舞いにおける特徴の機械論的な役割は、入力が特徴を活性化させる方法と特徴の活性化が出力にどのように影響するかによって決定されます。ステアリング評価を使用して、現在のパイプラインが出力における特徴の因果効果を捉えられていないことを明らかにします。この問題を解決するために、特徴の説明を自動生成するための効率的な出力中心の手法を提案します。これらの手法は、特徴の刺激後に重み付けされたトークンまたは特徴に直接「unembedding」ヘッドを適用した後の最も重要なトークンを使用します。出力中心の説明は、入力中心の説明よりも特徴がモデルの出力に与える因果効果をよりよく捉えますが、両者を組み合わせることで、入力と出力の両方の評価において最良のパフォーマンスを発揮します。最後に、出力中心の説明を使用して、以前は「無効」と考えられていた特徴を活性化する入力を見つけることができることを示します。

English

Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model's representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary "unembedding" head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be "dead".

出力中心の特徴記述による自動解釈性の向上

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

要旨

Summary

Support