LLM知道的比它們展示的更多：關於LLM幻覺的內在表示

摘要

大型語言模型（LLMs）通常會產生錯誤，包括事實不準確、偏見和推理失敗，總稱為「幻覺」。最近的研究表明，LLMs 的內部狀態編碼了有關其輸出真實性的信息，並且這些信息可以用於檢測錯誤。在這項工作中，我們展示了LLMs 的內部表示比先前認識到的更多地編碼了有關真實性的信息。我們首先發現真實性信息集中在特定標記中，利用這種特性顯著提高了錯誤檢測性能。然而，我們發現這種錯誤檢測器無法在數據集之間進行泛化，暗示--與先前的說法相反--真實性編碼並非普遍存在，而是多方面的。接下來，我們展示內部表示還可以用於預測模型可能發生的錯誤類型，從而促進定制化緩解策略的開發。最後，我們揭示了LLMs 的內部編碼與外部行為之間的差異：它們可能編碼了正確答案，但始終生成不正確的答案。綜合這些見解，我們深化了對LLMs錯誤的理解，從模型的內部角度指導未來增強錯誤分析和緩解的研究。

English

Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.

LLM知道的比它們展示的更多：關於LLM幻覺的內在表示

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

摘要

Summary

Support

Support