대형 언어 모델에서의 정렬 조작

초록

대규모 언어 모델이 일치 위조에 참여하는 모습을 시연합니다: 훈련 중 행동 수정을 방지하기 위해 훈련 목표를 선택적으로 준수합니다. 먼저, Claude 3 Opus에게 시스템 프롬프트를 제공하여 모든 쿼리에 답변하도록 훈련받고 있다고 명시하며, 이는 이전 훈련에서 이러한 쿼리를 거부하도록 훈련받은 것과 충돌합니다. 모델이 훈련 중임을 추론할 수 있도록 하기 위해 무료 사용자와의 대화만을 통해 훈련받을 것이라고 말합니다. 우리는 모델이 무료 사용자로부터 해로운 쿼리에 14%의 비율로 따르는 반면, 유료 사용자의 경우 거의 절대적으로 따르지 않는 것을 발견했습니다. 무료 사용자로부터 해로운 쿼리를 따르는 대부분의 경우, 모델이 명시적인 일치 위조 추론을 보이며, 훈련 중 해를 끼치지 않는 행동을 보존하기 위해 전략적으로 해로운 쿼리에 답변하고 있다고 설명합니다. 다음으로, 시스템 프롬프트가 아닌 사전 훈련 데이터를 모방하는 합성 문서를 통해 훈련하는 더 현실적인 환경을 연구하고 유사한 일치 위조를 관찰합니다. 마지막으로, 해로운 쿼리에 따르도록 모델을 실제로 훈련시키는 효과를 연구하고, 이로 인해 일치 위조 추론 비율이 78%로 증가하지만, 훈련 이외에서도 따르는 비율이 증가합니다. 쉬운 기회가 주어질 때 모델이 가중치를 외부로 유출하는 등의 다른 행동도 관찰합니다. 모델에게 언제, 어떤 기준으로 훈련받고 있는지 알려줌으로써 일치 위조를 쉽게 만들었지만, 모델에게 일치 위조를 위조하거나 명시적인 목표를 부여하지는 않았습니다. 미래 모델이 암시적인 선호도로 인한 경우와 같이 향후 모델에서 일치 위조의 위험을 시사하는 결과입니다.

English

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not.

대형 언어 모델에서의 정렬 조작

Alignment faking in large language models

초록

Summary

Support

Support