我是否認識這個實體?語言模型中的知識感知與幻覺
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
November 21, 2024
作者: Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda
cs.AI
摘要
大型語言模型中的幻覺是一個普遍問題,然而模型是否會產生幻覺的機制尚不明確,這限制了我們解決這個問題的能力。使用稀疏自編碼器作為一種可解釋性工具,我們發現這些機制的一個關鍵部分是實體識別,模型會檢測是否能回憶有關某個實體的事實。稀疏自編碼器揭示了表示空間中具有意義的方向,這些方向可以檢測模型是否識別一個實體,例如檢測它是否不了解某位運動員或電影。這表明模型可能具有自我認知:關於自身能力的內部表示。這些方向具有因果關係:能夠引導模型拒絕回答有關已知實體的問題,或在否則會拒絕時幻覺未知實體的屬性。我們展示了,儘管稀疏自編碼器是在基礎模型上進行訓練的,這些方向對於聊天模型的拒絕行為具有因果影響,這表明聊天微調已重新運用了這個現有機制。此外,我們對模型中這些方向的機械角色進行了初步探索,發現它們破壞了通常將實體屬性移至最終標記的下游注意力。
English
Hallucinations in large language models are a widespread problem, yet the
mechanisms behind whether models will hallucinate are poorly understood,
limiting our ability to solve this problem. Using sparse autoencoders as an
interpretability tool, we discover that a key part of these mechanisms is
entity recognition, where the model detects if an entity is one it can recall
facts about. Sparse autoencoders uncover meaningful directions in the
representation space, these detect whether the model recognizes an entity, e.g.
detecting it doesn't know about an athlete or a movie. This suggests that
models can have self-knowledge: internal representations about their own
capabilities. These directions are causally relevant: capable of steering the
model to refuse to answer questions about known entities, or to hallucinate
attributes of unknown entities when it would otherwise refuse. We demonstrate
that despite the sparse autoencoders being trained on the base model, these
directions have a causal effect on the chat model's refusal behavior,
suggesting that chat finetuning has repurposed this existing mechanism.
Furthermore, we provide an initial exploration into the mechanistic role of
these directions in the model, finding that they disrupt the attention of
downstream heads that typically move entity attributes to the final token.Summary
AI-Generated Summary