ToolSandbox:一种面向LLM工具使用能力的有状态、对话式、交互式评估基准。

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

August 8, 2024
作者: Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang
cs.AI

摘要

最近大型语言模型(LLMs)的进展引发了对工具辅助LLMs解决现实世界挑战的日益增长的研究兴趣,这需要对工具使用能力进行全面评估。之前的研究要么集中在评估基于无状态网络服务(RESTful API)的情况,基于单轮用户提示,要么是基于离线策略对话轨迹。ToolSandbox包括有状态工具执行、工具之间的隐式状态依赖、内置用户模拟器支持基于策略的对话评估,以及针对任意轨迹的中间和最终里程碑的动态评估策略。我们展示了开源和专有模型之间存在显著的性能差距,而ToolSandbox中定义的状态依赖、规范化和信息不足等复杂任务甚至对最具实力的SOTA LLMs也具有挑战性,为工具使用LLMs能力提供了全新的见解。ToolSandbox评估框架已发布在https://github.com/apple/ToolSandbox。
English
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox

Summary

AI-Generated Summary

PDF184November 28, 2024