数学推理的自奖励校正
Self-rewarding correction for mathematical reasoning
February 26, 2025
作者: Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang
cs.AI
摘要
我们研究了一种自奖励推理的大语言模型(LLMs),该模型能够在推理过程中同时生成逐步推理步骤并评估其输出的正确性,而无需外部反馈。这种集成方法使得单个模型能够独立引导其推理过程,为模型部署提供了计算优势。我们特别关注自我校正这一代表性任务,其中模型能够自主检测其响应中的错误、修订输出,并决定何时终止迭代优化循环。为此,我们提出了一个两阶段的算法框架,仅利用自生成数据构建自奖励推理模型。在第一阶段,我们采用顺序拒绝采样法合成包含自奖励和自我校正机制的长链思维轨迹。通过对这些精选数据进行微调,模型能够学习自奖励和自我校正的模式。在第二阶段,我们通过基于规则的信号进行强化学习,进一步增强模型评估响应准确性和优化输出的能力。在Llama-3和Qwen-2.5上的实验表明,我们的方法超越了内在的自我校正能力,并达到了与依赖外部奖励模型的系统相当的性能。
English
We study self-rewarding reasoning large language models (LLMs), which can
simultaneously generate step-by-step reasoning and evaluate the correctness of
their outputs during the inference time-without external feedback. This
integrated approach allows a single model to independently guide its reasoning
process, offering computational advantages for model deployment. We
particularly focus on the representative task of self-correction, where models
autonomously detect errors in their responses, revise outputs, and decide when
to terminate iterative refinement loops. To enable this, we propose a
two-staged algorithmic framework for constructing self-rewarding reasoning
models using only self-generated data. In the first stage, we employ sequential
rejection sampling to synthesize long chain-of-thought trajectories that
incorporate both self-rewarding and self-correction mechanisms. Fine-tuning
models on these curated data allows them to learn the patterns of
self-rewarding and self-correction. In the second stage, we further enhance the
models' ability to assess response accuracy and refine outputs through
reinforcement learning with rule-based signals. Experiments with Llama-3 and
Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction
capabilities and achieves performance comparable to systems that rely on
external reward models.Summary
AI-Generated Summary