S^2R: 강화 학습을 통한 대형 언어 모델의 자가 검증 및 자가 수정 능력 교육

초록

최근 연구들은 LLM(대형 언어 모델)의 테스트 시점 스케일링의 효과성을 입증해 왔습니다. 그러나 LLM의 심층 사고 능력을 촉진하기 위한 기존 접근 방식들은 대규모 데이터나 상당한 학습 노력을 필요로 하는 경우가 일반적입니다. 한편, 성능이 상대적으로 낮은 기본 모델들의 사고 능력을 향상시키는 방법은 여전히 불분명합니다. 본 연구에서는 S^2R이라는 효율적인 프레임워크를 소개하며, 이는 추론 과정에서 모델이 스스로 검증하고 수정하도록 가르침으로써 LLM의 추론 능력을 향상시킵니다. 구체적으로, 우리는 먼저 신중하게 선별된 데이터에 대한 지도 미세 조정을 통해 LLM에 반복적인 자기 검증 및 자기 수정 행동을 초기화합니다. 그런 다음, 결과 수준과 과정 수준의 강화 학습을 통해 자기 검증 및 자기 수정 기술을 더욱 강화하며, 최소한의 자원 요구로 모델이 추론 과정에서 적응적으로 사고 과정을 개선할 수 있도록 합니다. 우리의 결과는 단 3.1k개의 자기 검증 및 자기 수정 행동 초기화 샘플만으로 Qwen2.5-math-7B 모델의 정확도가 51.0%에서 81.6%로 향상되었음을 보여주며, 이는 동일한 양의 long-CoT 증류 데이터로 학습된 모델들을 능가하는 성능입니다. 도메인 내 및 도메인 외 벤치마크를 기반으로 한 세 가지 기본 모델에 대한 광범위한 실험과 분석은 S^2R의 효과성을 검증합니다. 우리의 코드와 데이터는 https://github.com/NineAbyss/S2R에서 확인할 수 있습니다.

English

Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S^2R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S^2R. Our code and data are available at https://github.com/NineAbyss/S2R.

S^2R: 강화 학습을 통한 대형 언어 모델의 자가 검증 및 자가 수정 능력 교육

S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

초록

Support