ToolSandbox:一种面向LLM工具使用能力的有状态、对话式、交互式评估基准。
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
August 8, 2024
作者: Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang
cs.AI
摘要
最近大型语言模型(LLMs)的进展引发了对工具辅助LLMs解决现实世界挑战的日益增长的研究兴趣,这需要对工具使用能力进行全面评估。之前的研究要么集中在评估基于无状态网络服务(RESTful API)的情况,基于单轮用户提示,要么是基于离线策略对话轨迹。ToolSandbox包括有状态工具执行、工具之间的隐式状态依赖、内置用户模拟器支持基于策略的对话评估,以及针对任意轨迹的中间和最终里程碑的动态评估策略。我们展示了开源和专有模型之间存在显著的性能差距,而ToolSandbox中定义的状态依赖、规范化和信息不足等复杂任务甚至对最具实力的SOTA LLMs也具有挑战性,为工具使用LLMs能力提供了全新的见解。ToolSandbox评估框架已发布在https://github.com/apple/ToolSandbox。
English
Recent large language models (LLMs) advancements sparked a growing research
interest in tool assisted LLMs solving real-world challenges, which calls for
comprehensive evaluation of tool-use capabilities. While previous works focused
on either evaluating over stateless web services (RESTful API), based on a
single turn user prompt, or an off-policy dialog trajectory, ToolSandbox
includes stateful tool execution, implicit state dependencies between tools, a
built-in user simulator supporting on-policy conversational evaluation and a
dynamic evaluation strategy for intermediate and final milestones over an
arbitrary trajectory. We show that open source and proprietary models have a
significant performance gap, and complex tasks like State Dependency,
Canonicalization and Insufficient Information defined in ToolSandbox are
challenging even the most capable SOTA LLMs, providing brand-new insights into
tool-use LLM capabilities. ToolSandbox evaluation framework is released at
https://github.com/apple/ToolSandboxSummary
AI-Generated Summary