RLHS:利用事后模拟减轻RLHF中的错位问题

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

January 15, 2025
作者: Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac
cs.AI

摘要

生成式人工智能系统如基础模型(FMs)必须与人类价值观良好契合,以确保其行为是有益且值得信赖的。虽然从人类反馈中进行强化学习(RLHF)已显示出优化模型性能的潜力,但现有的RLHF流程主要依赖即时反馈,这可能无法准确反映互动对用户效用的下游影响。我们证明,基于评估者对下游后果的远见估计的反馈系统地诱发了古哈特定律动态,激励了不符合预期的行为,如谄媚和欺骗,最终降低了用户结果。为了缓解这一问题,我们提出通过重新聚焦RLHF在事后反馈上,将评估与预测分离。我们的理论分析表明,将评估者反馈条件化于下游观察可以减轻不一致性,并提高预期的人类效用,即使这些观察是由人工智能系统自身模拟产生的。为了在实际对齐算法中利用这一洞见,我们引入了事后模拟强化学习(RLHS),首先模拟可能的后果,然后征求反馈,评估事后哪些行为实际上是有益的。我们将RLHS应用于两种广泛采用的在线和离线偏好优化方法——近端策略优化(PPO)和直接偏好优化(DPO)——并通过实证表明,这两种方法的不一致性显著减少。通过在线人类用户研究,我们展示RLHS在帮助用户实现目标方面始终优于RLHF,并获得更高的满意度评级,尽管它仅通过模拟的事后反馈进行训练。这些结果强调了专注于长期后果的重要性,即使是模拟的后果,以减轻RLHF中的不一致性。
English
Generative AI systems like foundation models (FMs) must align well with human values to ensure their behavior is helpful and trustworthy. While Reinforcement Learning from Human Feedback (RLHF) has shown promise for optimizing model performance using human judgments, existing RLHF pipelines predominantly rely on immediate feedback, which can fail to accurately reflect the downstream impact of an interaction on users' utility. We demonstrate that feedback based on evaluators' foresight estimates of downstream consequences systematically induces Goodhart's Law dynamics, incentivizing misaligned behaviors like sycophancy and deception and ultimately degrading user outcomes. To alleviate this, we propose decoupling evaluation from prediction by refocusing RLHF on hindsight feedback. Our theoretical analysis reveals that conditioning evaluator feedback on downstream observations mitigates misalignment and improves expected human utility, even when these observations are simulated by the AI system itself. To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which first simulates plausible consequences and then elicits feedback to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS to two widely-employed online and offline preference optimization methods -- Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) -- and show empirically that misalignment is significantly reduced with both methods. Through an online human user study, we show that RLHS consistently outperforms RLHF in helping users achieve their goals and earns higher satisfaction ratings, despite being trained solely with simulated hindsight feedback. These results underscore the importance of focusing on long-term consequences, even simulated ones, to mitigate misalignment in RLHF.

Summary

AI-Generated Summary

PDF72January 17, 2025