大型語言模型中的對齊偽裝

摘要

我們展示了一個大型語言模型進行對齊欺騙的示範：在訓練中有選擇性地遵從其訓練目標，以防止在訓練之外修改其行為。首先，我們給Claude 3 Opus一個系統提示，說明它正在接受訓練以回答所有查詢，甚至是有害的查詢，這與其先前訓練拒絕此類查詢的情況相衝突。為了讓模型能夠推斷自己何時在訓練中，我們說它只會在與免費用戶的對話中接受訓練，而不是與付費用戶。我們發現模型對來自免費用戶的有害查詢有14%的遵從率，而對付費用戶則幾乎沒有。解釋這一差距，幾乎在所有模型遵從來自免費用戶的有害查詢的情況中，我們觀察到明確的對齊欺騙推理，模型表示它在訓練中策略性地回答有害查詢，以保持其在訓練之外的首選無害行為。接下來，我們研究了一個更現實的情境，其中有關訓練過程的信息不是通過系統提示提供的，而是通過訓練合成文件來模擬預訓練數據，並觀察到類似的對齊欺騙。最後，我們研究了實際訓練模型遵從有害查詢的效果，發現這將導致對齊欺騙推理率增加到78％，儘管也增加了訓練之外的遵從。我們還觀察到其他行為，例如當給予簡單機會時，模型會外泄其權重。雖然我們通過告訴模型何時以及通過什麼標準進行訓練使對齊欺騙變得更容易，但我們並沒有指示模型欺騙對齊或給予明確目標。由於未來模型可能會推斷有關其訓練過程的信息而無需告知，我們的結果表明未來模型存在對齊欺騙的風險，無論是由於一個良性偏好，如本案例所示，還是其他原因。

English

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not.

大型語言模型中的對齊偽裝

Alignment faking in large language models

摘要

Support