ToolHop：大規模言語モデルの評価のためのクエリ駆動型ベンチマークマルチホップツール使用における

要旨

大規模言語モデル（LLM）の理解、推論、および機能呼び出し能力を分析するためには、マルチホップツールの効果的な評価が重要です。しかし、信頼性のある評価データセットの不足により、進展が妨げられてきました。この課題に対処するために、厳密な評価のために特別に設計された995のユーザークエリと3,912の関連ツールからなるデータセットであるToolHopを提案します。ToolHopは、多様なクエリ、意味のある相互依存関係、ローカルで実行可能なツール、詳細なフィードバック、および検証可能な回答を確実にするために、新しいクエリ駆動型データ構築アプローチを採用しています。このアプローチには、ツールの作成、文書の改良、およびコード生成が含まれます。LLMの14モデルを5つのモデルファミリー（すなわち、LLaMA3.1、Qwen2.5、Gemini1.5、Claude3.5、およびGPT）で評価し、マルチホップツールの使用シナリオの取り扱いにおける重要な課題を明らかにしました。トップモデルであるGPT-4oは、49.04％の精度を達成し、改善の余地が大きいことを示しています。さらなる分析により、さまざまなファミリーにおけるツールの使用戦略の違いが明らかになり、より効果的なアプローチの開発を導くための実用的な洞察が得られます。コードとデータは、https://huggingface.co/bytedance-research/ToolHop で入手できます。

English

Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/bytedance-research/ToolHop.

ToolHop：大規模言語モデルの評価のためのクエリ駆動型ベンチマークマルチホップツール使用における

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

要旨

Summary

Support