ChatPaper.aiChatPaper

精简高效:基于全局价值引导的解耦价值策略优化

Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

February 24, 2025
作者: Chenghua Huang, Lu Wang, Fangkai Yang, Pu Zhao, Zhixu Li, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
cs.AI

摘要

基于近端策略优化(PPO)的人类反馈强化学习(RLHF)对于使大型语言模型(LLMs)与人类偏好对齐至关重要。该方法需要联合训练一个行动者和评论者,并依赖一个预训练且固定的奖励模型进行指导。由于行动者与评论者之间的相互依赖,这一方法增加了计算复杂性和不稳定性。此外,在LLM任务中,PPO无法获取真实的环境奖励,限制了其适应性。在此情况下,预训练一个价值模型或奖励模型变得等效,因为两者均提供了固定的监督信号,而无需新的真实反馈。为解决这些问题,我们提出了解耦价值策略优化(DVPO),这是一个精简的框架,用预训练的全局价值模型(GVM)替代了传统的奖励建模。GVM基于策略轨迹进行条件化,并预测令牌级别的未来回报估计。通过将价值模型与策略训练解耦(通过冻结的GVM驱动的RL目标),DVPO消除了行动者与评论者间的相互依赖,相比传统RLHF,减少了40%的GPU内存使用和35%的训练时间。跨基准测试的实验表明,DVPO在性能上超越了高效的RLHF方法(如DPO),并与最先进的PPO方法持平。
English
Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose Decoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained global value model (GVM). The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.

Summary

AI-Generated Summary

PDF102February 28, 2025