ToolHop：一個以查詢驅動的基準測試，用於評估多跳工具使用的大型語言模型

摘要

對於分析大型語言模型（LLMs）的理解、推理和功能調用能力，有效評估多跳工具使用至關重要。然而，由於缺乏可靠的評估數據集，進展受到阻礙。為解決這一問題，我們提出了ToolHop，這是一個包含995個用戶查詢和3,912個相關工具的數據集，專門設計用於嚴格評估多跳工具使用。ToolHop通過一種新穎的查詢驅動的數據構建方法，包括工具創建、文檔精煉和代碼生成，確保了多樣的查詢、有意義的相互依賴、本地可執行的工具、詳細的反饋以及可驗證的答案。我們評估了14個LLMs，涵蓋五個模型系列（即LLaMA3.1、Qwen2.5、Gemini1.5、Claude3.5和GPT），揭示了處理多跳工具使用情景時的重大挑戰。領先的模型GPT-4o實現了49.04%的準確率，突顯了改進空間的重要性。進一步分析揭示了不同系列的工具使用策略變化，提供了可操作的見解，以指導更有效方法的開發。代碼和數據可在https://huggingface.co/bytedance-research/ToolHop找到。

English

Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/bytedance-research/ToolHop.

ToolHop：一個以查詢驅動的基準測試，用於評估多跳工具使用的大型語言模型

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

摘要

Summary

Support