지식 충돌 하에서 언어 모델의 잔여 스트림 분석

초록

대형 언어 모델(Large language models, LLMs)은 매개변수에 상당한 양의 사실적 지식을 저장할 수 있습니다. 그러나, 그들의 매개변수 지식은 맥락에서 제공된 정보와 충돌할 수 있습니다. 이러한 충돌은 오래된 또는 부정확한 정보에 의존하는 등 원치 않는 모델 행동으로 이어질 수 있습니다. 본 연구에서는 LLMs가 지식 충돌을 식별할 수 있는지, 그리고 LLM의 잔류 스트림을 분석함으로써 모델이 어떤 지식 소스에 의존할지 파악할 수 있는지 조사합니다. 프로빙 작업을 통해, LLMs가 잔류 스트림에서 지식 충돌의 신호를 내부적으로 등록할 수 있으며, 중간 모델 활성화를 분석함으로써 정확하게 감지할 수 있음을 발견했습니다. 이를 통해 입력이나 모델 매개변수를 수정하지 않고도 답변을 생성하기 전에 잔류 스트림 내의 충돌을 감지할 수 있습니다. 게다가, 모델이 맥락적 지식 대 매개변수 지식을 활용하여 충돌을 해결할 때 잔류 스트림이 현저히 다른 패턴을 보여준다는 사실을 발견했습니다. 이 패턴은 충돌이 발생했을 때 LLMs의 행동을 추정하고 답변을 생성하기 전에 예상치 못한 답변을 방지하는 데 활용될 수 있습니다. 우리의 분석은 LLMs가 내부적으로 지식 충돌을 어떻게 관리하는지에 대한 통찰을 제공하며, 지식 선택 프로세스를 제어하는 방법을 개발하는 기초를 제공합니다.

English

Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context. Such conflicts can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. In this work, we investigate whether LLMs can identify knowledge conflicts and whether it is possible to know which source of knowledge the model will rely on by analysing the residual stream of the LLM. Through probing tasks, we find that LLMs can internally register the signal of knowledge conflict in the residual stream, which can be accurately detected by probing the intermediate model activations. This allows us to detect conflicts within the residual stream before generating the answers without modifying the input or model parameters. Moreover, we find that the residual stream shows significantly different patterns when the model relies on contextual knowledge versus parametric knowledge to resolve conflicts. This pattern can be employed to estimate the behaviour of LLMs when conflict happens and prevent unexpected answers before producing the answers. Our analysis offers insights into how LLMs internally manage knowledge conflicts and provides a foundation for developing methods to control the knowledge selection processes.

지식 충돌 하에서 언어 모델의 잔여 스트림 분석

Analysing the Residual Stream of Language Models Under Knowledge Conflicts

초록

Summary

Support