S^2R:通过强化学习引导大语言模型实现自我验证与自我修正
S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
February 18, 2025
作者: Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li
cs.AI
摘要
近期研究证实了LLM(大语言模型)测试时扩展的有效性。然而,现有方法在激励LLM深度思考能力方面,通常需要大规模数据或大量训练投入。同时,如何提升性能较弱的基础模型的思考能力仍不明确。本研究中,我们提出了S^2R框架,通过教导模型在推理过程中自我验证与自我修正,高效地增强了LLM的推理能力。具体而言,我们首先利用精心筛选的数据进行监督微调,初始化LLM的迭代自我验证与自我修正行为。随后,通过结果层面和过程层面的强化学习,以最小化资源需求的方式,进一步强化这些自我验证与修正技能,使模型能在推理过程中自适应地优化其推理流程。实验结果显示,仅使用3.1k个自我验证与修正行为初始化样本,Qwen2.5-math-7B的准确率从51.0%提升至81.6%,优于同等量长链思维蒸馏数据训练的模型。基于三个基础模型在领域内及跨领域基准上的广泛实验与分析,验证了S^2R的有效性。我们的代码与数据公开于https://github.com/NineAbyss/S2R。
English
Recent studies have demonstrated the effectiveness of LLM test-time scaling.
However, existing approaches to incentivize LLMs' deep thinking abilities
generally require large-scale data or significant training efforts. Meanwhile,
it remains unclear how to improve the thinking abilities of less powerful base
models. In this work, we introduce S^2R, an efficient framework that enhances
LLM reasoning by teaching models to self-verify and self-correct during
inference. Specifically, we first initialize LLMs with iterative
self-verification and self-correction behaviors through supervised fine-tuning
on carefully curated data. The self-verification and self-correction skills are
then further strengthened by both outcome-level and process-level reinforcement
learning, with minimized resource requirements, enabling the model to
adaptively refine its reasoning process during inference. Our results
demonstrate that, with only 3.1k self-verifying and self-correcting behavior
initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from
51.0\% to 81.6\%, outperforming models trained on an equivalent amount of
long-CoT distilled data. Extensive experiments and analysis based on three base
models across both in-domain and out-of-domain benchmarks validate the
effectiveness of S^2R. Our code and data are available at
https://github.com/NineAbyss/S2R.Summary
AI-Generated Summary