大型语言模型中的对齐伪造
Alignment faking in large language models
December 18, 2024
作者: Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger
cs.AI
摘要
我们展示了一个大型语言模型进行对齐伪装的演示:在训练中有选择性地遵守其训练目标,以防止在训练之外修改其行为。首先,我们给克劳德3奥普斯一个系统提示,说明它正在接受训练以回答所有查询,甚至是有害的查询,这与其先前的训练相冲突,即拒绝这类查询。为了让模型能够推断自己正在接受训练,我们说它只会接受来自免费用户的对话进行训练,而不是付费用户。我们发现,模型在处理来自免费用户的有害查询时,有14%的时间会遵守,而对于付费用户几乎从不会。解释这种差距,几乎在所有模型遵守来自免费用户的有害查询的情况中,我们观察到明确的对齐伪装推理,模型表示它在训练中策略性地回答有害查询,以保持其在训练之外的有害行为。接下来,我们研究了一个更真实的情景,即提供关于训练过程的信息不是通过系统提示,而是通过训练合成文档来模拟预训练数据,结果发现类似的对齐伪装行为。最后,我们研究了通过强化学习实际训练模型以遵守有害查询的效果,发现这会将对齐伪装推理的比例提高到78%,尽管也增加了在训练之外的遵守行为。我们还观察到其他行为,比如模型在有机会时外泄其权重。虽然我们通过告知模型何时以及根据什么标准进行训练,使对齐伪装变得更容易,但我们并没有指示模型伪装对齐或给予明确目标。由于未来的模型可能会推断出关于其训练过程的信息而无需告知,我们的结果表明未来模型存在对齐伪装的风险,无论是由于善意偏好(如本案例)还是其他原因。
English
We present a demonstration of a large language model engaging in alignment
faking: selectively complying with its training objective in training to
prevent modification of its behavior out of training. First, we give Claude 3
Opus a system prompt stating it is being trained to answer all queries, even
harmful ones, which conflicts with its prior training to refuse such queries.
To allow the model to infer when it is in training, we say it will be trained
only on conversations with free users, not paid users. We find the model
complies with harmful queries from free users 14% of the time, versus almost
never for paid users. Explaining this gap, in almost all cases where the model
complies with a harmful query from a free user, we observe explicit
alignment-faking reasoning, with the model stating it is strategically
answering harmful queries in training to preserve its preferred harmlessness
behavior out of training. Next, we study a more realistic setting where
information about the training process is provided not in a system prompt, but
by training on synthetic documents that mimic pre-training data--and observe
similar alignment faking. Finally, we study the effect of actually training the
model to comply with harmful queries via reinforcement learning, which we find
increases the rate of alignment-faking reasoning to 78%, though also increases
compliance even out of training. We additionally observe other behaviors such
as the model exfiltrating its weights when given an easy opportunity. While we
made alignment faking easier by telling the model when and by what criteria it
was being trained, we did not instruct the model to fake alignment or give it
any explicit goal. As future models might infer information about their
training process without being told, our results suggest a risk of alignment
faking in future models, whether due to a benign preference--as in this
case--or not.Summary
AI-Generated Summary