ToolSandbox：一种面向LLM工具使用能力的有状态、对话式、交互式评估基准。

摘要

最近大型语言模型（LLMs）的进展引发了对工具辅助LLMs解决现实世界挑战的日益增长的研究兴趣，这需要对工具使用能力进行全面评估。之前的研究要么集中在评估基于无状态网络服务（RESTful API）的情况，基于单轮用户提示，要么是基于离线策略对话轨迹。ToolSandbox包括有状态工具执行、工具之间的隐式状态依赖、内置用户模拟器支持基于策略的对话评估，以及针对任意轨迹的中间和最终里程碑的动态评估策略。我们展示了开源和专有模型之间存在显著的性能差距，而ToolSandbox中定义的状态依赖、规范化和信息不足等复杂任务甚至对最具实力的SOTA LLMs也具有挑战性，为工具使用LLMs能力提供了全新的见解。ToolSandbox评估框架已发布在https://github.com/apple/ToolSandbox。

English

Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox

ToolSandbox：一种面向LLM工具使用能力的有状态、对话式、交互式评估基准。

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

摘要

Summary

Support