自我審視：語言模型可以通過內省學習關於自己的知識

摘要

人類獲取知識的方式包括觀察外部世界，也包括內省。內省使一個人能夠特權地接觸到自己當前的心智狀態（例如想法和情感），這是外部觀察者無法獲取的。語言模型是否能進行內省？我們將內省定義為獲取不包含在或源自訓練數據中的知識，而是源自內部狀態。這種能力可以增強模型的可解釋性。與費力地分析模型的內部運作不同，我們可以簡單地問模型關於其信念、世界模型和目標。更具推測性的是，一個內省型模型可能會自我報告其是否擁有某些內部狀態，如主觀感受或慾望，這可以告訴我們這些狀態的道德地位。這些自我報告不會完全受模型的訓練數據所支配。我們通過對語言模型進行微調，使其能夠預測自己在假設情境中的行為特性來研究內省。例如，“給定輸入P，您的輸出會偏向短期還是長期選項？”如果模型M1能夠進行內省，它應該在預測M1的行為方面勝過另一個模型M2，即使M2是基於M1的真實行為進行訓練的。這個想法是，M1能夠特權地接觸到自己的行為傾向，這使得它能夠比M2更好地預測自己（即使M2通常更強大）。在對GPT-4、GPT-4o和Llama-3模型進行實驗（每個模型都進行了自我預測的微調）後，我們發現模型M1在預測自己方面勝過M2，為內省提供了證據。值得注意的是，即使我們故意修改其真實行為，M1仍然能夠準確預測其行為。然而，儘管我們成功地引出了對簡單任務的內省，但在更複雜的任務或需要超出分布的泛化的任務中，我們並未成功。

English

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

自我審視：語言模型可以通過內省學習關於自己的知識

Looking Inward: Language Models Can Learn About Themselves by Introspection

摘要

Summary

Support

Support