LLM知道的比它們展示的更多:關於LLM幻覺的內在表示
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
October 3, 2024
作者: Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, Yonatan Belinkov
cs.AI
摘要
大型語言模型(LLMs)通常會產生錯誤,包括事實不準確、偏見和推理失敗,總稱為「幻覺」。最近的研究表明,LLMs 的內部狀態編碼了有關其輸出真實性的信息,並且這些信息可以用於檢測錯誤。在這項工作中,我們展示了LLMs 的內部表示比先前認識到的更多地編碼了有關真實性的信息。我們首先發現真實性信息集中在特定標記中,利用這種特性顯著提高了錯誤檢測性能。然而,我們發現這種錯誤檢測器無法在數據集之間進行泛化,暗示--與先前的說法相反--真實性編碼並非普遍存在,而是多方面的。接下來,我們展示內部表示還可以用於預測模型可能發生的錯誤類型,從而促進定制化緩解策略的開發。最後,我們揭示了LLMs 的內部編碼與外部行為之間的差異:它們可能編碼了正確答案,但始終生成不正確的答案。綜合這些見解,我們深化了對LLMs錯誤的理解,從模型的內部角度指導未來增強錯誤分析和緩解的研究。
English
Large language models (LLMs) often produce errors, including factual
inaccuracies, biases, and reasoning failures, collectively referred to as
"hallucinations". Recent studies have demonstrated that LLMs' internal states
encode information regarding the truthfulness of their outputs, and that this
information can be utilized to detect errors. In this work, we show that the
internal representations of LLMs encode much more information about
truthfulness than previously recognized. We first discover that the
truthfulness information is concentrated in specific tokens, and leveraging
this property significantly enhances error detection performance. Yet, we show
that such error detectors fail to generalize across datasets, implying that --
contrary to prior claims -- truthfulness encoding is not universal but rather
multifaceted. Next, we show that internal representations can also be used for
predicting the types of errors the model is likely to make, facilitating the
development of tailored mitigation strategies. Lastly, we reveal a discrepancy
between LLMs' internal encoding and external behavior: they may encode the
correct answer, yet consistently generate an incorrect one. Taken together,
these insights deepen our understanding of LLM errors from the model's internal
perspective, which can guide future research on enhancing error analysis and
mitigation.Summary
AI-Generated Summary