S^2R：通过强化学习引导大语言模型实现自我验证与自我修正

摘要

近期研究证实了LLM（大语言模型）测试时扩展的有效性。然而，现有方法在激励LLM深度思考能力方面，通常需要大规模数据或大量训练投入。同时，如何提升性能较弱的基础模型的思考能力仍不明确。本研究中，我们提出了S^2R框架，通过教导模型在推理过程中自我验证与自我修正，高效地增强了LLM的推理能力。具体而言，我们首先利用精心筛选的数据进行监督微调，初始化LLM的迭代自我验证与自我修正行为。随后，通过结果层面和过程层面的强化学习，以最小化资源需求的方式，进一步强化这些自我验证与修正技能，使模型能在推理过程中自适应地优化其推理流程。实验结果显示，仅使用3.1k个自我验证与修正行为初始化样本，Qwen2.5-math-7B的准确率从51.0%提升至81.6%，优于同等量长链思维蒸馏数据训练的模型。基于三个基础模型在领域内及跨领域基准上的广泛实验与分析，验证了S^2R的有效性。我们的代码与数据公开于https://github.com/NineAbyss/S2R。

English

Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S^2R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S^2R. Our code and data are available at https://github.com/NineAbyss/S2R.

S^2R：通过强化学习引导大语言模型实现自我验证与自我修正

S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

摘要

Summary

Support