OTC：基於強化學習的最優工具調用

摘要

工具集成推理（Tool-integrated Reasoning, TIR）增強了大語言模型（Large Language Models, LLMs）的能力，使其能夠調用外部工具，如搜索引擎和代碼解釋器，以解決僅靠語言推理無法完成的任務。雖然強化學習（Reinforcement Learning, RL）在通過優化最終答案正確性來改進TIR方面顯示出潛力，但現有方法往往忽略了工具使用的效率和成本。這可能導致次優行為，包括過多的工具調用增加了計算和財務開銷，或工具使用不足影響了答案質量。在本研究中，我們提出了最優工具調用控制策略優化（Optimal Tool Call-controlled Policy Optimization, OTC-PO），這是一個簡單而有效的基於RL的框架，鼓勵模型以最少的工具調用生成準確的答案。我們的方法引入了一種工具集成獎勵，同時考慮正確性和工具效率，從而提高工具生產力。我們在近端策略優化（Proximal Policy Optimization, PPO）和群組相對偏好優化（Group Relative Preference Optimization, GRPO）中實例化了這一框架，分別形成了OTC-PPO和OTC-GRPO。在多個QA基準測試中，使用Qwen-2.5和Qwen-Math進行的實驗表明，我們的方法將工具調用減少了高達73.1%，並將工具生產力提高了高達229.4%，同時保持了相當的答案準確性。據我們所知，這是第一個明確優化TIR中工具使用效率的基於RL的框架。

English

Tool-integrated reasoning (TIR) augments large language models (LLMs) with the ability to invoke external tools, such as search engines and code interpreters, to solve tasks beyond the capabilities of language-only reasoning. While reinforcement learning (RL) has shown promise in improving TIR by optimizing final answer correctness, existing approaches often overlook the efficiency and cost associated with tool usage. This can lead to suboptimal behavior, including excessive tool calls that increase computational and financial overhead, or insufficient tool use that compromises answer quality. In this work, we propose Optimal Tool Call-controlled Policy Optimization (OTC-PO), a simple yet effective RL-based framework that encourages models to produce accurate answers with minimal tool calls. Our method introduces a tool-integrated reward that jointly considers correctness and tool efficiency, promoting high tool productivity. We instantiate this framework within both Proximal Policy Optimization (PPO) and Group Relative Preference Optimization (GRPO), resulting in OTC-PPO and OTC-GRPO. Experiments with Qwen-2.5 and Qwen-Math across multiple QA benchmarks show that our approach reduces tool calls by up to 73.1\% and improves tool productivity by up to 229.4\%, while maintaining comparable answer accuracy. To the best of our knowledge, this is the first RL-based framework that explicitly optimizes tool-use efficiency in TIR.

OTC：基於強化學習的最優工具調用

OTC: Optimal Tool Calls via Reinforcement Learning

摘要

Summary

Support

Support