출력 중심 특징 설명을 활용한 자동 해석 능력 향상

초록

자동 해석 가능성 파이프라인은 대형 언어 모델 (LLM)의 특성으로 표현되는 개념에 대한 자연어 설명을 생성합니다. 이러한 설명은 특성을 활성화하는 입력을 사용하여 유도되며, 이 입력은 모델의 표현 공간에서 차원이나 방향일 수 있습니다. 그러나 활성화 입력을 식별하는 것은 비용이 많이 들며, 특성의 모델 동작에서의 기계적 역할은 입력이 특성을 활성화하는 방법과 특성 활성화가 출력에 어떻게 영향을 미치는지에 의해 결정됩니다. 조종 평가를 사용하여 현재의 파이프라인이 출력에 대한 특성의 인과 효과를 포착하지 못하는 설명을 제공한다는 것을 밝혀냅니다. 이를 해결하기 위해 우리는 효율적인, 출력 중심적 방법을 제안하여 특성 설명을 자동으로 생성합니다. 이 방법은 특성 자극 후 가중치가 높은 토큰을 사용하거나 특성에 직접 "unembedding" 헤드를 적용한 후 가장 높은 가중치 토큰을 사용합니다. 우리의 출력 중심적 설명은 입력 중심적 설명보다 모델 출력에 대한 특성의 인과 효과를 더 잘 포착하지만, 두 가지를 결합하면 입력 및 출력 평가 모두에서 최상의 성능을 보입니다. 마지막으로, 출력 중심적 설명을 사용하여 이전에 "죽은" 것으로 여겨졌던 특성을 활성화하는 입력을 찾을 수 있다는 것을 보여줍니다.

English

Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model's representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary "unembedding" head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be "dead".

출력 중심 특징 설명을 활용한 자동 해석 능력 향상

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

초록

Support