ToolRL:奖励机制即工具学习之所需
ToolRL: Reward is All Tool Learning Needs
April 16, 2025
作者: Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji
cs.AI
摘要
当前的大型语言模型(LLMs)常通过监督微调(SFT)来掌握工具使用能力。然而,SFT在面对陌生或复杂的工具使用场景时,其泛化能力显得不足。近期,强化学习(RL)领域的进展,尤其是类似R1模型的引入,展现了卓越的推理与泛化潜力。但针对工具使用的奖励设计面临独特挑战:多种工具可能需搭配多样参数调用,而诸如答案匹配等粗粒度奖励信号,难以提供有效学习所需的细致反馈。本研究首次在RL框架下,对工具选择与应用任务的奖励设计进行了全面探讨。我们系统性地考察了多种奖励策略,分析其类型、尺度、粒度及时序动态。基于这些洞见,我们提出了一套针对工具使用任务的原则性奖励设计方案,并采用群体相对策略优化(GRPO)训练LLMs。跨多个基准的实证评估表明,该方法实现了稳健、可扩展且稳定的训练效果,相较于基础模型提升了17%,较SFT模型也有15%的增益。这些成果凸显了精心设计的奖励机制在提升LLMs工具使用能力及泛化性能中的关键作用。所有代码均已公开,以促进未来研究。
English
Current Large Language Models (LLMs) often undergo supervised fine-tuning
(SFT) to acquire tool use capabilities. However, SFT struggles to generalize to
unfamiliar or complex tool use scenarios. Recent advancements in reinforcement
learning (RL), particularly with R1-like models, have demonstrated promising
reasoning and generalization abilities. Yet, reward design for tool use
presents unique challenges: multiple tools may be invoked with diverse
parameters, and coarse-grained reward signals, such as answer matching, fail to
offer the finegrained feedback required for effective learning. In this work,
we present the first comprehensive study on reward design for tool selection
and application tasks within the RL paradigm. We systematically explore a wide
range of reward strategies, analyzing their types, scales, granularity, and
temporal dynamics. Building on these insights, we propose a principled reward
design tailored for tool use tasks and apply it to train LLMs using Group
Relative Policy Optimization (GRPO). Empirical evaluations across diverse
benchmarks demonstrate that our approach yields robust, scalable, and stable
training, achieving a 17% improvement over base models and a 15% gain over SFT
models. These results highlight the critical role of thoughtful reward design
in enhancing the tool use capabilities and generalization performance of LLMs.
All the codes are released to facilitate future research.Summary
AI-Generated Summary