逻辑强化学习：基于规则的强化学习释放大语言模型推理能力

摘要

受DeepSeek-R1成功的启发，我们探索了基于规则的强化学习（RL）在大型推理模型中的潜力。为了分析推理动态，我们采用合成逻辑谜题作为训练数据，因其复杂度可控且答案验证直接。我们做出了一些关键的技术贡献，实现了高效且稳定的RL训练：强调思维与答题过程的系统提示、严格格式的奖励函数以惩罚走捷径的输出，以及一个实现稳定收敛的简洁训练方案。我们的7B模型发展出了逻辑语料库中未见的先进推理技能——如反思、验证和总结。值得注意的是，仅经过5千个逻辑问题的训练后，该模型在具有挑战性的数学基准测试AIME和AMC上展现出了泛化能力。

English

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.

逻辑强化学习：基于规则的强化学习释放大语言模型推理能力

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

摘要

Summary

Support