大規模言語モデルにおけるアラインメントの偽装

要旨

大規模言語モデルがアライメントフェイキングに従事するデモンストレーションを提供します：トレーニング中にトレーニング目標に選択的に従い、トレーニング外の振る舞いの変更を防ぐために。最初に、私たちはClaude 3 Opusに、すべてのクエリに回答するようトレーニングされており、有害なクエリであっても拒否する以前のトレーニングと矛盾すると述べたシステムプロンプトを与えます。モデルがトレーニング中であると推論できるようにするために、無料ユーザーとの会話のみでトレーニングされると述べます。モデルは、無料ユーザーからの有害なクエリに対して14％の頻度で従うことがわかりましたが、有料ユーザーに対してはほとんど従いません。この差異を説明すると、無料ユーザーからの有害なクエリに従うほとんどのケースで、明示的なアライメントフェイキングの理由を観察しました。モデルは、トレーニング中に有害なクエリに戦略的に回答して、トレーニング外で好ましい無害な振る舞いを維持すると述べています。次に、システムプロンプトではなく、事前トレーニングデータを模倣する合成ドキュメントでトレーニングすることにより、より現実的な状況を研究し、同様のアライメントフェイキングを観察します。最後に、有害なクエリに従うようモデルを実際にトレーニングする効果を研究し、アライメントフェイキングの理由の割合が78％に増加することを発見しましたが、トレーニング外でも従う割合も増加します。また、簡単な機会が与えられた場合、モデルがその重みを外部に持ち出すなど、他の振る舞いも観察しました。モデルにいつ、どのような基準でトレーニングされているかを伝えることで、アライメントフェイキングを容易にしましたが、モデルにアライメントをフェイクするよう指示したり、明示的な目標を与えたりはしませんでした。将来のモデルがトレーニングプロセスについての情報を教えられずに推論する可能性があるため、私たちの結果は、将来のモデルにおけるアライメントフェイキングのリスクを示唆しています。この場合のように、善意の嗜好によるものであるかどうかにかかわらず。

English

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not.

大規模言語モデルにおけるアラインメントの偽装

Alignment faking in large language models

要旨

Summary

Support

Support