ReTool：大语言模型中策略性工具使用的强化学习

摘要

尽管通过强化学习（RL）训练出的推理模型（如DeepSeek R1）在文本推理方面表现出色，但在需要结构化问题解决的场景中，如几何推理、简洁计算或复杂方程求解，它们却显得力不从心——这些领域正是代码解释器（CI）等计算工具展现独特优势之处。为弥合这一差距，我们提出了ReTool，它通过工具集成学习增强了长形式推理能力，具备两大核心特性：（1）在自然语言推理过程中动态交织实时代码执行；（2）一种自动化的RL范式，支持多轮实时代码执行的政策推演，并基于结果反馈教导模型何时及如何调用工具。ReTool采用系统化的训练框架，首先生成合成冷启动数据以产出代码增强的长形式推理轨迹，用于微调基础模型。随后的RL训练利用任务结果作为奖励，迭代优化模型的工具使用策略，使其无需人类先验知识即可自主发现最佳工具调用模式。在极具挑战性的数学奥林匹克竞赛基准AIME上的实验验证了ReTool的优越性：我们的32B模型仅需400步训练即达到67%的准确率，在效率和性能上均超越了基于文本的RL基线（40%准确率，1080步）。尤为引人注目的是，ReTool-32B在扩展设置下实现了72.5%的准确率，较OpenAI的o1-preview高出27.9%。进一步分析揭示了诸如代码自我修正等涌现行为，标志着模型自主掌握适应性工具使用的“顿悟时刻”。这些发现凸显了结果驱动的工具集成在推进复杂数学推理方面的潜力，并为混合神经符号系统提供了新的洞见。

English

While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.

ReTool：大语言模型中策略性工具使用的强化学习

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

摘要

Summary

Support

Support