REINFORCE++:一种简单高效的方法,用于对齐大型语言模型。

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

January 4, 2025
作者: Jian Hu
cs.AI

摘要

人类反馈强化学习(RLHF)已成为将大型语言模型与人类偏好对齐的关键方法,通过诸如近端策略优化(PPO)、直接偏好优化(DPO)、REINFORCE Leave One-Out(RLOO)、ReMax 和群体相对策略优化(GRPO)等方法,见证了快速的算法演进。我们提出了REINFORCE++,这是经典REINFORCE算法的增强变体,它融合了PPO的关键优化技术,同时消除了对评论网络的需求。REINFORCE++ 实现了三个主要目标:(1)简单性,(2)增强的训练稳定性,以及(3)降低的计算开销。通过广泛的实证评估,我们证明了REINFORCE++ 相对于GRPO表现出更优越的稳定性,并且比PPO实现了更高的计算效率,同时保持了可比较的性能。该实现可在https://github.com/OpenRLHF/OpenRLHF 获取。
English
Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical approach for aligning large language models with human preferences, witnessing rapid algorithmic evolution through methods such as Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO). We present REINFORCE++, an enhanced variant of the classical REINFORCE algorithm that incorporates key optimization techniques from PPO while eliminating the need for a critic network. REINFORCE++ achieves three primary objectives: (1) simplicity (2) enhanced training stability, and (3) reduced computational overhead. Through extensive empirical evaluation, we demonstrate that REINFORCE++ exhibits superior stability compared to GRPO and achieves greater computational efficiency than PPO while maintaining comparable performance. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.

Summary

AI-Generated Summary

PDF822January 8, 2025