從拒絕抽樣到強化學習：大型語言模型推理的極簡主義方法

摘要

強化學習（RL）已成為在複雜推理任務上微調大型語言模型（LLMs）的主流方法。在近期的方法中，GRPO因其在訓練如DeepSeek-R1等模型上的實證成功而脫穎而出，但其有效性的來源仍鮮為人知。在本研究中，我們從類似強化算法的角度重新審視GRPO，並分析其核心組件。令人驚訝的是，我們發現一個簡單的拒絕採樣基線方法RAFT，僅在正向獎勵樣本上進行訓練，其性能竟與GRPO和PPO相當。我們的消融研究揭示，GRPO的主要優勢來自於剔除完全錯誤回應的提示，而非其獎勵標準化。基於這一洞察，我們提出了Reinforce-Rej，這是策略梯度的一個最小擴展，它過濾掉完全錯誤和完全正確的樣本。Reinforce-Rej提升了KL效率與穩定性，作為一個輕量級且有效的替代方案，可取代更複雜的RL算法。我們主張將RAFT作為一個穩健且可解釋的基線，並建議未來的進展應專注於更為原則性的設計來整合負樣本，而非不加區分地依賴它們。我們的研究結果為未來基於獎勵的LLM後續訓練工作提供了指導。

English

Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical success in training models such as DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components. Surprisingly, we find that a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO's main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization. Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples. Reinforce-Rej improves KL efficiency and stability, serving as a lightweight yet effective alternative to more complex RL algorithms. We advocate RAFT as a robust and interpretable baseline, and suggest that future advances should focus on more principled designs for incorporating negative samples, rather than relying on them indiscriminately. Our findings provide guidance for future work in reward-based LLM post-training.

從拒絕抽樣到強化學習：大型語言模型推理的極簡主義方法

A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce

摘要

Summary

Support

Support