내적 관찰: 언어 모델은 자기에 대해 학습할 수 있습니다.

초록

인간은 외부 세계를 관찰함으로써 지식을 습득하지만, 내성적으로도 지식을 얻습니다. 내성적으로는 외부 관찰자에게 접근할 수 없는 현재의 마음 상태(예: 생각 및 감정)에 대한 특권된 접근을 제공합니다. LLMs가 내성적으로 판단할 수 있을까요? 우리는 내성적으로 지식을 습득하는 것을 훈련 데이터에 포함되지 않거나 파생되지 않고 대신 내부 상태에서 발생하는 것으로 정의합니다. 이러한 능력은 모델 해석 가능성을 향상시킬 수 있습니다. 모델의 내부 작동을 고통스럽게 분석하는 대신, 우리는 모델에게 그것의 신념, 세계 모델 및 목표에 대해 간단히 물어볼 수 있습니다. 더 구체적으로 내성적인 모델은 주관적인 감정이나 욕망과 같은 특정 내부 상태를 보유하고 있는지에 대해 자체 보고를 할 수 있으며, 이는 이러한 상태의 도덕적 지위에 대해 우리에게 정보를 제공할 수 있습니다. 이러한 자체 보고는 모델의 훈련 데이터에 완전히 의존하지 않을 것입니다. 우리는 LLMs를 세밀하게 조정하여 가상 시나리오에서 자신의 행동 특성을 예측하도록 합니다. 예를 들어, "입력 P가 주어졌을 때, 당신의 출력은 단기적인 옵션을 선호할까요 장기적인 옵션을 선호할까요?" 만약 모델 M1이 내성적으로 판단할 수 있다면, M2가 M1의 실제 행동에 대해 훈련되었더라도 M1의 행동을 예측하는 데에서 M2보다 우수한 성과를 보여야 합니다. 이 아이디어는 M1이 자신의 행동 성향에 특권된 접근을 가지고 있으며, 이를 통해 M2보다 자신을 더 잘 예측할 수 있다는 것입니다(비록 M2가 일반적으로 강력하더라도). GPT-4, GPT-4o 및 Llama-3 모델을 실험하여(각각 자신을 예측하도록 세밀하게 조정), 우리는 모델 M1이 M2를 자신을 예측하는 데에서 능가함으로써 내성적임을 입증합니다. 특히, M1은 우리가 일부러 그것의 실제 행동을 수정한 후에도 정확하게 자신의 행동을 예측합니다. 그러나 우리는 단순한 작업에서 내성을 성공적으로 유도했지만, 더 복잡한 작업이나 분포 외 일반화가 필요한 작업에서는 성공하지 못했습니다.

English

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

내적 관찰: 언어 모델은 자기에 대해 학습할 수 있습니다.

Looking Inward: Language Models Can Learn About Themselves by Introspection

초록

Support