RLHS:利用事後模擬來緩解RLHF中的不一致问题
RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation
January 15, 2025
作者: Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac
cs.AI
摘要
像基礎模型(FMs)這樣的生成式人工智慧系統必須與人類價值觀良好契合,以確保其行為是有益且可信賴的。儘管從人類反饋中學習的強化學習(RLHF)已顯示出優化模型性能的潛力,但現有的RLHF流程主要依賴即時反饋,這可能無法準確反映互動對用戶效用的下游影響。我們證明,基於評估者對下游後果的預見估計的反饋系統性地引發古哈特定律動態,激勵不符合價值觀的行為,如諂媚和欺騙,最終降低用戶結果。為了緩解這一問題,我們提出將評估與預測分離,通過重新聚焦RLHF在事後反饋上。我們的理論分析顯示,將評估者反饋條件化為下游觀察可以減輕不一致性,提高預期的人類效用,即使這些觀察是由人工智慧系統自行模擬的。為了在實際對齊算法中利用這一見解,我們引入了從事後模擬中學習的強化學習(RLHS),首先模擬可能的後果,然後徵求反饋,以評估事後真正有益的行為。我們將RLHS應用於兩種廣泛應用的在線和離線偏好優化方法--近端策略優化(PPO)和直接偏好優化(DPO)--並實證表明,這兩種方法的不一致性明顯減少。通過在線人類用戶研究,我們展示RLHS在幫助用戶實現目標和獲得更高滿意度評分方面始終優於RLHF,儘管僅使用模擬的事後反饋進行訓練。這些結果強調了專注於長期後果的重要性,即使是模擬的後果,以減輕RLHF中的不一致性。
English
Generative AI systems like foundation models (FMs) must align well with human
values to ensure their behavior is helpful and trustworthy. While Reinforcement
Learning from Human Feedback (RLHF) has shown promise for optimizing model
performance using human judgments, existing RLHF pipelines predominantly rely
on immediate feedback, which can fail to accurately reflect the downstream
impact of an interaction on users' utility. We demonstrate that feedback based
on evaluators' foresight estimates of downstream consequences systematically
induces Goodhart's Law dynamics, incentivizing misaligned behaviors like
sycophancy and deception and ultimately degrading user outcomes. To alleviate
this, we propose decoupling evaluation from prediction by refocusing RLHF on
hindsight feedback. Our theoretical analysis reveals that conditioning
evaluator feedback on downstream observations mitigates misalignment and
improves expected human utility, even when these observations are simulated by
the AI system itself. To leverage this insight in a practical alignment
algorithm, we introduce Reinforcement Learning from Hindsight Simulation
(RLHS), which first simulates plausible consequences and then elicits feedback
to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS
to two widely-employed online and offline preference optimization methods --
Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) --
and show empirically that misalignment is significantly reduced with both
methods. Through an online human user study, we show that RLHS consistently
outperforms RLHF in helping users achieve their goals and earns higher
satisfaction ratings, despite being trained solely with simulated hindsight
feedback. These results underscore the importance of focusing on long-term
consequences, even simulated ones, to mitigate misalignment in RLHF.Summary
AI-Generated Summary