ComplexFuncBench:在长上下文场景下探索多步骤和受限函数调用
ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario
January 17, 2025
作者: Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, Jie Tang
cs.AI
摘要
利用实时API增强大型语言模型(LLMs)可以帮助生成更准确和最新的响应。然而,在真实场景中评估LLMs的函数调用能力仍未得到充分探讨,这是由于数据收集和评估的复杂性。在这项工作中,我们介绍了ComplexFuncBench,这是一个针对五种真实场景的复杂函数调用基准测试。与现有基准测试相比,ComplexFuncBench包括多步骤和受限函数调用,需要长参数填充、参数值推理和128k长上下文。此外,我们提出了一个自动框架ComplexEval,用于定量评估复杂函数调用任务。通过全面实验,我们展示了现有LLMs在函数调用方面的不足,并提出了优化这些能力的未来方向。数据和代码可在https://github.com/THUDM/ComplexFuncBench 上获得。
English
Enhancing large language models (LLMs) with real-time APIs can help generate
more accurate and up-to-date responses. However, evaluating the function
calling abilities of LLMs in real-world scenarios remains under-explored due to
the complexity of data collection and evaluation. In this work, we introduce
ComplexFuncBench, a benchmark for complex function calling across five
real-world scenarios. Compared to existing benchmarks, ComplexFuncBench
encompasses multi-step and constrained function calling, which requires
long-parameter filing, parameter value reasoning, and 128k long context.
Additionally, we propose an automatic framework, ComplexEval, for
quantitatively evaluating complex function calling tasks. Through comprehensive
experiments, we demonstrate the deficiencies of state-of-the-art LLMs in
function calling and suggest future directions for optimizing these
capabilities. The data and code are available at
https://github.com/THUDM/ComplexFuncBench.Summary
AI-Generated Summary