복잡한 기능 벤치마킹: 장기 맥락 시나리오에서의 다단계 및 제약 조건 함수 호출 탐구

초록

대규모 언어 모델(LLM)을 실시간 API로 강화하면 보다 정확하고 최신의 응답을 생성하는 데 도움이 될 수 있습니다. 그러나 실제 시나리오에서 LLM의 함수 호출 능력을 평가하는 것은 데이터 수집 및 평가의 복잡성으로 인해 미개척된 상태입니다. 본 연구에서는 다섯 가지 실제 시나리오에서 복잡한 함수 호출을 위한 벤치마크인 ComplexFuncBench를 소개합니다. 기존의 벤치마크와 비교하여 ComplexFuncBench는 다단계 및 제약이 있는 함수 호출을 포함하며, 긴 매개변수 작성, 매개변수 값 추론 및 128k 길이의 컨텍스트가 필요합니다. 더불어, 복잡한 함수 호출 작업을 양적으로 평가하기 위한 자동 프레임워크인 ComplexEval을 제안합니다. 포괄적인 실험을 통해 최첨단 LLM의 함수 호출 능력의 결핍을 증명하고, 이러한 능력을 최적화하기 위한 미래 방향을 제안합니다. 데이터와 코드는 https://github.com/THUDM/ComplexFuncBench에서 확인할 수 있습니다.

English

Enhancing large language models (LLMs) with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the complexity of data collection and evaluation. In this work, we introduce ComplexFuncBench, a benchmark for complex function calling across five real-world scenarios. Compared to existing benchmarks, ComplexFuncBench encompasses multi-step and constrained function calling, which requires long-parameter filing, parameter value reasoning, and 128k long context. Additionally, we propose an automatic framework, ComplexEval, for quantitatively evaluating complex function calling tasks. Through comprehensive experiments, we demonstrate the deficiencies of state-of-the-art LLMs in function calling and suggest future directions for optimizing these capabilities. The data and code are available at https://github.com/THUDM/ComplexFuncBench.

복잡한 기능 벤치마킹: 장기 맥락 시나리오에서의 다단계 및 제약 조건 함수 호출 탐구

ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario

초록

Support