探索数学推理学习中结果奖励的极限

摘要

推理能力，特别是解决复杂数学问题的能力，是智能的关键组成部分。像OpenAI的o-series模型这样的专有公司最近在推理任务上取得了显著进展。然而，完整的技术细节仍未披露，目前认为采用的技术只有强化学习（RL）和长序列思维。本文提出了一个新的RL框架，称为OREAL，旨在通过基于结果奖励的强化学习来追求数学推理任务的性能极限，其中只有二元结果奖励是容易获得的。我们在理论上证明，从最佳N（BoN）抽样的正轨迹上的行为克隆足以学习二元反馈环境中的KL正则化最优策略。这一公式进一步暗示，负样本的奖励应重新塑造，以确保正负样本之间的梯度一致性。为了缓解强化学习中由稀疏奖励带来的长期困难，这些困难甚至被用于推理任务的长序列思维的部分正确性所加剧，我们进一步应用了一个基于令牌级别的奖励模型，以对推理轨迹中的重要令牌进行采样学习。通过OREAL，第一次，一个7B模型可以通过RL在MATH-500上获得94.0的pass@1准确率，与32B模型不相上下。OREAL-32B还超过了之前通过蒸馏训练的32B模型，在MATH-500上获得95.0的pass@1准确率。我们的研究还表明了RL的初始策略模型和训练查询的重要性。代码、模型和数据将被发布以造福未来的研究。

English

Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through Outcome REwArd-based reinforcement Learning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future researchhttps://github.com/InternLM/OREAL.

探索数学推理学习中结果奖励的极限

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

摘要

Summary

Support