VinePPO：透過精細化的信用分配釋放 LLM 推理的強化學習潛力

摘要

大型語言模型（LLMs）越來越多地應用於需要在獲得任何獎勵之前執行多個複雜步驟的複雜推理任務。正確地將功勞分配給這些步驟對於提升模型性能至關重要。Proximal Policy Optimization（PPO）是一種用於LLM微調的最先進的強化學習（RL）算法，採用價值網絡來應對功勞分配。然而，價值網絡在複雜推理任務中準確預測預期累積獎勵方面面臨挑戰，通常導致高變異更新和次優性能。在這項工作中，我們系統評估了價值網絡的有效性，揭示了它們在重度推理的LLM任務中的顯著缺陷，顯示在比較替代步驟時，它們幾乎只比隨機基準線稍強。為了應對這一問題，我們提出了VinePPO，一種利用語言環境靈活性來計算無偏蒙特卡洛估計的簡單方法，從而避免了對大型價值網絡的需求。我們的方法在MATH和GSM8K數據集上一貫優於PPO和其他無RL基準線，並且需要較少的梯度更新（最多9倍），較少的牆鐘時間（最多3.0倍）。這些結果強調了在LLM的RL微調中準確的功勞分配的重要性，並展示了VinePPO作為一種優越替代方案的潛力。

English

Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receiving any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, value networks face challenges in predicting the expected cumulative rewards accurately in complex reasoning tasks, often leading to high-variance updates and suboptimal performance. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they barely outperform a random baseline when comparing alternative steps. To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates, bypassing the need for large value networks. Our method consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets with fewer gradient updates (up to 9x), less wall-clock time (up to 3.0x). These results emphasize the importance of accurate credit assignment in RL finetuning of LLM and demonstrate VinePPO's potential as a superior alternative.

VinePPO：透過精細化的信用分配釋放 LLM 推理的強化學習潛力

VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment

摘要

Summary

Support

Support