在強化學習遷移學習中的政策過濾，用於微調用於程式碼生成的大型語言模型

摘要

從人類反饋中學習的強化學習（RLHF）是幫助大型語言模型（LLMs）遵循指示並提供有用且無害回應的關鍵技術之一。儘管存在直接政策優化方法，但最先進的LLMs採用基於RL的方法（通常是PPO）在RLHF中訓練政策，以生成由偏好數據學習的獎勵模型引導的良好回應。這些方法的主要挑戰是中間獎勵模型的不準確性，特別是在需要進行長時間和複雜推理以對回應進行評分的代碼生成任務中。我們發現獎勵模型的可靠性在分配不同獎勵的回應之間存在差異。這激勵我們過濾那些獎勵可能不可靠的樣本，以提高政策學習期間的信噪比，從而產生適用於Proximal Policy Optimization（PF-PPO）的政策過濾。為了為給定的獎勵模型選擇適當的政策過濾策略，過濾樣本上獎勵與實際分數之間的決定係數（R^2）作為一個良好的指標，幫助我們找到幾種有前途的策略。我們進行了大量實驗，驗證了PF-PPO在代碼生成任務中的有效性，並發現PF-PPO的某些變體非常有效，在HumanEval、MBPP以及一個新且更具挑戰性的LeetCode競賽基準測試上實現了新的最先進性能，這些模型都具有70億參數。

English

Reinforcement learning from human feedback (RLHF) is one of the key techniques that helps large language models (LLMs) to follow instructions and provide helpful and harmless responses. While direct policy optimization methods exist, state-of-the-art LLMs adopt RL-based methods (usually PPO) in RLHF to train the policy to generate good responses guided by a reward model learned from preference data. The main challenge of these methods is the inaccuracy of the intermediate reward model, especially in code generation tasks that require long and complex reasoning to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtration strategy for a given reward model, the coefficient of determination (R^2) between rewards and actual scores on filtered samples serves as a good metrics and helps us find several promising strategies. We provide extensive experiments to validate the effectiveness of PF-PPO in code generation tasks, and find that some variants of PF-PPO are highly effective and achieve new state-of-the-art performance across 7-billion-parameter models on HumanEval, MBPP, and a new and more challenging LeetCode Contest benchmark.

在強化學習遷移學習中的政策過濾，用於微調用於程式碼生成的大型語言模型

Policy Filtration in RLHF to Fine-Tune LLM for Code Generation

摘要

Summary

Support

Support