REINFORCE++:一種簡單高效的方法,用於對齊大型語言模型。
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
January 4, 2025
作者: Jian Hu
cs.AI
摘要
從人類反饋中學習的強化學習(RLHF)已成為將大型語言模型與人類偏好對齊的關鍵方法,通過諸如Proximal Policy Optimization(PPO)、Direct Preference Optimization(DPO)、REINFORCE Leave One-Out(RLOO)、ReMax和Group Relative Policy Optimization(GRPO)等方法,見證了快速的算法演進。我們提出了REINFORCE++,這是對經典REINFORCE算法的增強變體,它融合了PPO的關鍵優化技術,同時消除了對評論網絡的需求。REINFORCE++實現了三個主要目標:(1)簡單性,(2)增強的訓練穩定性,以及(3)降低的計算開銷。通過廣泛的實證評估,我們證明了REINFORCE++相對於GRPO表現出更優越的穩定性,並實現了比PPO更大的計算效率,同時保持可比的性能。該實現可在https://github.com/OpenRLHF/OpenRLHF 上找到。
English
Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical
approach for aligning large language models with human preferences, witnessing
rapid algorithmic evolution through methods such as Proximal Policy
Optimization (PPO), Direct Preference Optimization (DPO), REINFORCE Leave
One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO). We
present REINFORCE++, an enhanced variant of the classical REINFORCE algorithm
that incorporates key optimization techniques from PPO while eliminating the
need for a critic network. REINFORCE++ achieves three primary objectives: (1)
simplicity (2) enhanced training stability, and (3) reduced computational
overhead. Through extensive empirical evaluation, we demonstrate that
REINFORCE++ exhibits superior stability compared to GRPO and achieves greater
computational efficiency than PPO while maintaining comparable performance. The
implementation is available at https://github.com/OpenRLHF/OpenRLHF.Summary
AI-Generated Summary