Migliorare l'interpretabilità automatizzata con descrizioni delle caratteristiche orientate all'output

Abstract

I pipeline automatizzati per l'interpretabilità generano descrizioni in linguaggio naturale per i concetti rappresentati dalle caratteristiche nei grandi modelli linguistici (LLM), come le piante o la prima parola in una frase. Queste descrizioni sono derivate utilizzando input che attivano la caratteristica, che può essere una dimensione o una direzione nello spazio di rappresentazione del modello. Tuttavia, identificare gli input attivanti è costoso, e il ruolo meccanicistico di una caratteristica nel comportamento del modello è determinato sia da come gli input causano l'attivazione di una caratteristica, sia da come l'attivazione della caratteristica influisce sugli output. Utilizzando valutazioni di steering, riveliamo che i pipeline attuali forniscono descrizioni che non riescono a catturare l'effetto causale della caratteristica sugli output. Per risolvere questo problema, proponiamo metodi efficienti e centrati sugli output per generare automaticamente descrizioni delle caratteristiche. Questi metodi utilizzano i token con peso maggiore dopo la stimolazione della caratteristica o i token con il peso più alto dopo l'applicazione della testa di "unembedding" del vocabolario direttamente alla caratteristica. Le nostre descrizioni centrate sugli output catturano meglio l'effetto causale di una caratteristica sugli output del modello rispetto alle descrizioni centrate sugli input, ma combinare le due approcci porta alle migliori prestazioni sia nelle valutazioni degli input che degli output. Infine, dimostriamo che le descrizioni centrate sugli output possono essere utilizzate per trovare input che attivano caratteristiche considerate precedentemente "morte".

English

Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model's representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary "unembedding" head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be "dead".

Migliorare l'interpretabilità automatizzata con descrizioni delle caratteristiche orientate all'output

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

Abstract

Support