REINFORCE++: 대규모 언어 모델을 정렬하는 간단하고 효율적인 방법

초록

인간 피드백으로부터의 강화 학습 (RLHF)은 대규모 언어 모델을 인간의 선호에 맞추는 데 중요한 접근법으로 등장했으며, Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), REINFORCE Leave One-Out (RLOO), ReMax 및 Group Relative Policy Optimization (GRPO)와 같은 방법을 통해 빠른 알고리즘적 진화를 경험했습니다. 우리는 고전적인 REINFORCE 알고리즘의 향상된 변형인 REINFORCE++를 제시합니다. 이는 PPO에서의 주요 최적화 기술을 통합하면서 비평가 네트워크의 필요성을 제거합니다. REINFORCE++는 세 가지 주요 목표를 달성합니다: (1) 간단함, (2) 향상된 훈련 안정성, (3) 감소된 계산 부담. 광범위한 경험적 평가를 통해, 우리는 REINFORCE++가 GRPO보다 우수한 안정성을 보이고 PPO보다 더 큰 계산 효율성을 달성하면서 비슷한 성능을 유지한다는 것을 입증합니다. 구현은 https://github.com/OpenRLHF/OpenRLHF에서 사용할 수 있습니다.

English

Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical approach for aligning large language models with human preferences, witnessing rapid algorithmic evolution through methods such as Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO). We present REINFORCE++, an enhanced variant of the classical REINFORCE algorithm that incorporates key optimization techniques from PPO while eliminating the need for a critic network. REINFORCE++ achieves three primary objectives: (1) simplicity (2) enhanced training stability, and (3) reduced computational overhead. Through extensive empirical evaluation, we demonstrate that REINFORCE++ exhibits superior stability compared to GRPO and achieves greater computational efficiency than PPO while maintaining comparable performance. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.

REINFORCE++: 대규모 언어 모델을 정렬하는 간단하고 효율적인 방법

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

초록

Support