START：具备工具使用能力的自学习推理器

摘要

诸如OpenAI-o1和DeepSeek-R1等大型推理模型（LRMs）通过运用长链思维（CoT）在复杂推理任务中展现了卓越的能力。然而，这些模型由于仅依赖内部推理过程，常出现幻觉和效率低下的问题。本文介绍了一种新型工具集成长链思维推理大模型START（Self-Taught Reasoner with Tools），它通过利用外部工具显著增强了推理能力。通过代码执行，START能够进行复杂计算、自我检查、探索多种方法及自我调试，从而解决了LRMs的局限性。START的核心创新在于其自学习框架，该框架包含两项关键技术：1）提示推理（Hint-infer）：我们证明，在LRM的推理过程中插入人工设计的提示（例如，“等等，也许在这里使用Python是个好主意。”）能有效激发其利用外部工具的能力，而无需任何示范数据。提示推理还可作为一种简单有效的序列测试时间扩展方法；2）提示拒绝采样微调（Hint-RFT）：Hint-RFT结合了Hint-infer和RFT，通过对LRM通过Hint-infer生成的带有工具调用的推理轨迹进行评分、筛选和修改，随后对LRM进行微调。通过这一框架，我们微调了QwQ-32B模型，实现了START。在博士级科学问答（GPQA）、竞赛级数学基准测试（AMC23、AIME24、AIME25）以及竞赛级代码基准测试（LiveCodeBench）上，START分别达到了63.6%、95.0%、66.7%、47.1%和47.3%的准确率。它显著超越了基础QwQ-32B模型，并与最先进的开放权重模型R1-Distill-Qwen-32B及专有模型o1-Preview的性能相当。

English

Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.

START：具备工具使用能力的自学习推理器

START: Self-taught Reasoner with Tools

摘要

Summary

Support

Support