ChatPaper.aiChatPaper

检索模型尚未掌握工具使用:大语言模型工具检索能力基准测试

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

March 3, 2025
作者: Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, Zhaochun Ren
cs.AI

摘要

工具学习旨在通过多样化的工具增强大型语言模型(LLMs),使其能够作为代理解决实际任务。由于使用工具的LLMs上下文长度有限,采用信息检索(IR)模型从大规模工具集中筛选有用工具成为关键的第一步。然而,IR模型在工具检索任务中的表现仍未被充分探索且不明确。大多数工具使用基准通过手动预标注每个任务的一小部分相关工具来简化这一步骤,这与现实场景相去甚远。本文提出了ToolRet,一个包含7.6k个多样化检索任务的异构工具检索基准,以及一个从现有数据集中收集的43k工具库。我们在ToolRet上对六类模型进行了基准测试。令人惊讶的是,即使在传统IR基准中表现优异的模型,在ToolRet上也表现不佳。这种低检索质量降低了工具使用LLMs的任务通过率。作为进一步措施,我们贡献了一个包含超过200k实例的大规模训练数据集,显著优化了IR模型的工具检索能力。
English
Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.

Summary

AI-Generated Summary

PDF42March 6, 2025