ToolHop:用于评估多跳工具使用中大型语言模型的查询驱动基准。
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use
January 5, 2025
作者: Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, Jiechao Chen
cs.AI
摘要
对多跳工具使用的有效评估对于分析大型语言模型(LLMs)的理解、推理和函数调用能力至关重要。然而,由于缺乏可靠的评估数据集,进展受阻。为解决这一问题,我们提出了ToolHop,这是一个包含995个用户查询和3,912个相关工具的数据集,专门设计用于严格评估多跳工具使用。ToolHop通过一种新颖的基于查询驱动的数据构建方法,包括工具创建、文档细化和代码生成,确保了多样化的查询、有意义的相互依赖、可在本地执行的工具、详细反馈以及可验证的答案。我们评估了14个LLMs,涵盖了五个模型系列(即LLaMA3.1、Qwen2.5、Gemini1.5、Claude3.5和GPT),揭示了处理多跳工具使用场景中的重大挑战。领先的模型GPT-4o实现了49.04%的准确率,突显了改进空间的重要性。进一步分析揭示了不同系列的工具使用策略的差异,为指导开发更有效方法提供了可操作的见解。代码和数据可在https://huggingface.co/bytedance-research/ToolHop找到。
English
Effective evaluation of multi-hop tool use is critical for analyzing the
understanding, reasoning, and function-calling capabilities of large language
models (LLMs). However, progress has been hindered by a lack of reliable
evaluation datasets. To address this, we present ToolHop, a dataset comprising
995 user queries and 3,912 associated tools, specifically designed for rigorous
evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful
interdependencies, locally executable tools, detailed feedback, and verifiable
answers through a novel query-driven data construction approach that includes
tool creation, document refinement, and code generation. We evaluate 14 LLMs
across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and
GPT), uncovering significant challenges in handling multi-hop tool-use
scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%,
underscoring substantial room for improvement. Further analysis reveals
variations in tool-use strategies for various families, offering actionable
insights to guide the development of more effective approaches. Code and data
can be found in https://huggingface.co/bytedance-research/ToolHop.Summary
AI-Generated Summary