OTC：通过强化学习实现最优工具调用

摘要

工具集成推理（TIR）增强了大型语言模型（LLMs）调用外部工具的能力，如搜索引擎和代码解释器，以解决仅凭语言推理无法完成的任务。尽管强化学习（RL）在通过优化最终答案正确性来改进TIR方面显示出潜力，但现有方法往往忽视了工具使用的效率和成本。这可能导致次优行为，包括过多的工具调用增加计算和财务开销，或工具使用不足影响答案质量。在本研究中，我们提出了最优工具调用控制策略优化（OTC-PO），这是一个简单而有效的基于RL的框架，鼓励模型以最少的工具调用生成准确答案。我们的方法引入了一种工具集成奖励，同时考虑正确性和工具效率，促进工具的高效使用。我们将这一框架实例化于近端策略优化（PPO）和群体相对偏好优化（GRPO）中，分别形成了OTC-PPO和OTC-GRPO。在多个QA基准测试中，使用Qwen-2.5和Qwen-Math进行的实验表明，我们的方法最多减少了73.1%的工具调用，并将工具效率提高了最多229.4%，同时保持了相当的答案准确性。据我们所知，这是首个在TIR中明确优化工具使用效率的基于RL的框架。

English

Tool-integrated reasoning (TIR) augments large language models (LLMs) with the ability to invoke external tools, such as search engines and code interpreters, to solve tasks beyond the capabilities of language-only reasoning. While reinforcement learning (RL) has shown promise in improving TIR by optimizing final answer correctness, existing approaches often overlook the efficiency and cost associated with tool usage. This can lead to suboptimal behavior, including excessive tool calls that increase computational and financial overhead, or insufficient tool use that compromises answer quality. In this work, we propose Optimal Tool Call-controlled Policy Optimization (OTC-PO), a simple yet effective RL-based framework that encourages models to produce accurate answers with minimal tool calls. Our method introduces a tool-integrated reward that jointly considers correctness and tool efficiency, promoting high tool productivity. We instantiate this framework within both Proximal Policy Optimization (PPO) and Group Relative Preference Optimization (GRPO), resulting in OTC-PPO and OTC-GRPO. Experiments with Qwen-2.5 and Qwen-Math across multiple QA benchmarks show that our approach reduces tool calls by up to 73.1\% and improves tool productivity by up to 229.4\%, while maintaining comparable answer accuracy. To the best of our knowledge, this is the first RL-based framework that explicitly optimizes tool-use efficiency in TIR.

OTC：通过强化学习实现最优工具调用

OTC: Optimal Tool Calls via Reinforcement Learning

摘要

Summary

Support

Support