ToolHop：用于评估多跳工具使用中大型语言模型的查询驱动基准。

摘要

对多跳工具使用的有效评估对于分析大型语言模型（LLMs）的理解、推理和函数调用能力至关重要。然而，由于缺乏可靠的评估数据集，进展受阻。为解决这一问题，我们提出了ToolHop，这是一个包含995个用户查询和3,912个相关工具的数据集，专门设计用于严格评估多跳工具使用。ToolHop通过一种新颖的基于查询驱动的数据构建方法，包括工具创建、文档细化和代码生成，确保了多样化的查询、有意义的相互依赖、可在本地执行的工具、详细反馈以及可验证的答案。我们评估了14个LLMs，涵盖了五个模型系列（即LLaMA3.1、Qwen2.5、Gemini1.5、Claude3.5和GPT），揭示了处理多跳工具使用场景中的重大挑战。领先的模型GPT-4o实现了49.04%的准确率，突显了改进空间的重要性。进一步分析揭示了不同系列的工具使用策略的差异，为指导开发更有效方法提供了可操作的见解。代码和数据可在https://huggingface.co/bytedance-research/ToolHop找到。

English

Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/bytedance-research/ToolHop.

ToolHop：用于评估多跳工具使用中大型语言模型的查询驱动基准。

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

摘要

Summary

Support