ToolHop:一個以查詢驅動的基準測試,用於評估多跳工具使用的大型語言模型
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use
January 5, 2025
作者: Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, Jiechao Chen
cs.AI
摘要
對於分析大型語言模型(LLMs)的理解、推理和功能調用能力,有效評估多跳工具使用至關重要。然而,由於缺乏可靠的評估數據集,進展受到阻礙。為解決這一問題,我們提出了ToolHop,這是一個包含995個用戶查詢和3,912個相關工具的數據集,專門設計用於嚴格評估多跳工具使用。ToolHop通過一種新穎的查詢驅動的數據構建方法,包括工具創建、文檔精煉和代碼生成,確保了多樣的查詢、有意義的相互依賴、本地可執行的工具、詳細的反饋以及可驗證的答案。我們評估了14個LLMs,涵蓋五個模型系列(即LLaMA3.1、Qwen2.5、Gemini1.5、Claude3.5和GPT),揭示了處理多跳工具使用情景時的重大挑戰。領先的模型GPT-4o實現了49.04%的準確率,突顯了改進空間的重要性。進一步分析揭示了不同系列的工具使用策略變化,提供了可操作的見解,以指導更有效方法的開發。代碼和數據可在https://huggingface.co/bytedance-research/ToolHop找到。
English
Effective evaluation of multi-hop tool use is critical for analyzing the
understanding, reasoning, and function-calling capabilities of large language
models (LLMs). However, progress has been hindered by a lack of reliable
evaluation datasets. To address this, we present ToolHop, a dataset comprising
995 user queries and 3,912 associated tools, specifically designed for rigorous
evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful
interdependencies, locally executable tools, detailed feedback, and verifiable
answers through a novel query-driven data construction approach that includes
tool creation, document refinement, and code generation. We evaluate 14 LLMs
across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and
GPT), uncovering significant challenges in handling multi-hop tool-use
scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%,
underscoring substantial room for improvement. Further analysis reveals
variations in tool-use strategies for various families, offering actionable
insights to guide the development of more effective approaches. Code and data
can be found in https://huggingface.co/bytedance-research/ToolHop.Summary
AI-Generated Summary