我是否了解这个实体？语言模型中的知识感知和幻觉

摘要

大型语言模型中的幻觉是一个普遍存在的问题，然而模型产生幻觉的机制尚不明确，这限制了我们解决这一问题的能力。利用稀疏自编码器作为可解释性工具，我们发现这些机制的关键部分是实体识别，即模型检测实体是否是它可以回忆事实的实体。稀疏自编码器在表示空间中揭示了有意义的方向，这些方向可以检测模型是否识别一个实体，例如，检测模型是否不了解某个运动员或电影。这表明模型可能具有自我认知：关于自身能力的内部表示。这些方向具有因果关系：能够引导模型拒绝回答关于已知实体的问题，或者在模型本应拒绝时产生对未知实体属性的幻觉。我们证明，尽管稀疏自编码器是在基础模型上训练的，但这些方向对于聊天模型的拒绝行为具有因果影响，这表明聊天微调已重新利用了这一现有机制。此外，我们初步探讨了这些方向在模型中的机械作用，发现它们扰乱了通常将实体属性移动到最终标记的下游头部的注意力。

English

Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.

我是否了解这个实体？语言模型中的知识感知和幻觉

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

摘要

Support