REINFORCE++: 大規模言語モデルの調整のためのシンプルで効率的なアプローチ

要旨

人間のフィードバックからの強化学習（RLHF）は、大規模言語モデルを人間の好みに合わせるための重要な手法として台頭し、Proximal Policy Optimization（PPO）、Direct Preference Optimization（DPO）、REINFORCE Leave One-Out（RLOO）、ReMax、Group Relative Policy Optimization（GRPO）などの手法を通じて急速なアルゴリズムの進化を目撃しています。私たちは、古典的なREINFORCEアルゴリズムの強化バリアントであるREINFORCE++を提案します。この手法は、PPOからの主要な最適化技術を取り入れつつ、評価者ネットワークの必要性を排除しています。REINFORCE++は、3つの主要な目標を達成します：（1）単純さ、（2）強化されたトレーニングの安定性、および（3）計算オーバーヘッドの削減。包括的な経験的評価を通じて、REINFORCE++は、GRPOよりも優れた安定性を示し、PPOよりも優れた計算効率を達成しつつ、同等の性能を維持します。実装はhttps://github.com/OpenRLHF/OpenRLHFで入手可能です。

English

Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical approach for aligning large language models with human preferences, witnessing rapid algorithmic evolution through methods such as Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO). We present REINFORCE++, an enhanced variant of the classical REINFORCE algorithm that incorporates key optimization techniques from PPO while eliminating the need for a critic network. REINFORCE++ achieves three primary objectives: (1) simplicity (2) enhanced training stability, and (3) reduced computational overhead. Through extensive empirical evaluation, we demonstrate that REINFORCE++ exhibits superior stability compared to GRPO and achieves greater computational efficiency than PPO while maintaining comparable performance. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.

REINFORCE++: 大規模言語モデルの調整のためのシンプルで効率的なアプローチ

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

要旨

Support