RLHS：ヒンズサイドシミュレーションによるRLHFの不整合の緩和

要旨

ファウンデーション・モデル（FMs）などの生成型AIシステムは、その振る舞いが有益で信頼性があり、人間の価値観とよく一致している必要があります。人間の判断を用いた強化学習（RLHF）は、モデルのパフォーマンスを最適化するための有望な手法として示されていますが、既存のRLHFパイプラインは主に即時フィードバックに依存しており、ユーザーの効用に対する相互作用の下流への影響を正確に反映できない可能性があります。我々は、評価者の先見の見積もりに基づくフィードバックが、Goodhartの法則のダイナミクスを系統的に引き起こし、おべっかや欺瞞のような非整合な振る舞いを促進し、最終的にユーザーの結果を悪化させることを示します。この問題を解消するために、我々は評価と予測を分離することを提案し、RLHFを後見的フィードバックに再焦点化することを提案します。理論的な分析により、評価者のフィードバックを下流の観察に依存させることで、非整合を緩和し、期待される人間の効用を向上させることが示されました。この洞察を実践的な整合アルゴリズムに活用するために、我々は後見的シミュレーションからの強化学習（RLHS）を導入し、まず可能な結果をシミュレートしてから、後見的に真に有益だった行動を評価するためのフィードバックを引き出します。我々は、このRLHSを、広く用いられているオンラインおよびオフラインの好み最適化手法であるProximal Policy Optimization（PPO）およびDirect Preference Optimization（DPO）に適用し、両手法ともに非整合が大幅に軽減されることを実証します。オンラインのヒューマンユーザースタディを通じて、RLHSがユーザーが目標を達成するのを助ける点でRLHFを一貫して上回り、満足度の評価が高いことを示します。これらの結果は、RLHFにおける非整合を緩和するために、長期的な結果に焦点を当てることの重要性を強調しています。

English

Generative AI systems like foundation models (FMs) must align well with human values to ensure their behavior is helpful and trustworthy. While Reinforcement Learning from Human Feedback (RLHF) has shown promise for optimizing model performance using human judgments, existing RLHF pipelines predominantly rely on immediate feedback, which can fail to accurately reflect the downstream impact of an interaction on users' utility. We demonstrate that feedback based on evaluators' foresight estimates of downstream consequences systematically induces Goodhart's Law dynamics, incentivizing misaligned behaviors like sycophancy and deception and ultimately degrading user outcomes. To alleviate this, we propose decoupling evaluation from prediction by refocusing RLHF on hindsight feedback. Our theoretical analysis reveals that conditioning evaluator feedback on downstream observations mitigates misalignment and improves expected human utility, even when these observations are simulated by the AI system itself. To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which first simulates plausible consequences and then elicits feedback to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS to two widely-employed online and offline preference optimization methods -- Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) -- and show empirically that misalignment is significantly reduced with both methods. Through an online human user study, we show that RLHS consistently outperforms RLHF in helping users achieve their goals and earns higher satisfaction ratings, despite being trained solely with simulated hindsight feedback. These results underscore the importance of focusing on long-term consequences, even simulated ones, to mitigate misalignment in RLHF.

RLHS：ヒンズサイドシミュレーションによるRLHFの不整合の緩和

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

要旨

Support