ComplexFuncBench：在长上下文场景下探索多步骤和受限函数调用

摘要

利用实时API增强大型语言模型（LLMs）可以帮助生成更准确和最新的响应。然而，在真实场景中评估LLMs的函数调用能力仍未得到充分探讨，这是由于数据收集和评估的复杂性。在这项工作中，我们介绍了ComplexFuncBench，这是一个针对五种真实场景的复杂函数调用基准测试。与现有基准测试相比，ComplexFuncBench包括多步骤和受限函数调用，需要长参数填充、参数值推理和128k长上下文。此外，我们提出了一个自动框架ComplexEval，用于定量评估复杂函数调用任务。通过全面实验，我们展示了现有LLMs在函数调用方面的不足，并提出了优化这些能力的未来方向。数据和代码可在https://github.com/THUDM/ComplexFuncBench 上获得。

English

Enhancing large language models (LLMs) with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the complexity of data collection and evaluation. In this work, we introduce ComplexFuncBench, a benchmark for complex function calling across five real-world scenarios. Compared to existing benchmarks, ComplexFuncBench encompasses multi-step and constrained function calling, which requires long-parameter filing, parameter value reasoning, and 128k long context. Additionally, we propose an automatic framework, ComplexEval, for quantitatively evaluating complex function calling tasks. Through comprehensive experiments, we demonstrate the deficiencies of state-of-the-art LLMs in function calling and suggest future directions for optimizing these capabilities. The data and code are available at https://github.com/THUDM/ComplexFuncBench.

ComplexFuncBench：在长上下文场景下探索多步骤和受限函数调用

ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario

摘要

Summary

Support

Support