Logic-RL: 규칙 기반 강화 학습을 통한 대형 언어 모델의 추론 능력 극대화

초록

DeepSeek-R1의 성공에 영감을 받아, 우리는 대규모 추론 모델에서 규칙 기반 강화 학습(RL)의 잠재력을 탐구합니다. 추론 역학을 분석하기 위해, 우리는 통제 가능한 복잡성과 직관적인 정답 검증이 가능한 합성 논리 퍼즐을 학습 데이터로 사용합니다. 우리는 효과적이고 안정적인 RL 학습을 이끌어내는 몇 가지 핵심 기술적 기여를 합니다: 사고와 답변 과정을 강조하는 시스템 프롬프트, 지름길을 택하는 출력에 대해 패널티를 부여하는 엄격한 형식 보상 함수, 그리고 안정적인 수렴을 달성하는 직관적인 학습 레시피가 그것입니다. 우리의 7B 모델은 논리 코퍼스에는 없는 반영, 검증, 요약과 같은 고급 추론 능력을 개발합니다. 주목할 만하게도, 단 5,000개의 논리 문제를 학습한 후, 이 모델은 도전적인 수학 벤치마크인 AIME와 AMC에 대한 일반화 능력을 보여줍니다.

English

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.

Logic-RL: 규칙 기반 강화 학습을 통한 대형 언어 모델의 추론 능력 극대화

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

초록

Support