MTU-Bench: 대규모 언어 모델을 위한 다중 미세먼지도구 벤치마크

초록

대형 언어 모델 (LLM)은 추론 및 의사 결정 능력에서 엄청난 향상을 보여주었으며 사용자와 자연스러운 대화를 할 수 있습니다. 최근에는 많은 도구 사용 벤치마크 데이터셋이 제안되었습니다. 그러나 기존 데이터셋은 다음과 같은 제한 사항이 있습니다: (1) 충분하지 않은 평가 시나리오 (예: 한정된 도구 사용 장면만 다룸). (2) 평가 비용이 많이 듦 (예: GPT API 비용). 이러한 제한 사항을 해결하기 위해 본 연구에서는 대형 언어 모델을 위한 다중 단계 도구 사용 벤치마크인 MTU-Bench를 제안합니다. "다중 단계" 특성을 갖는 MTU-Bench는 다섯 가지 도구 사용 장면 (즉, 단일 턴 및 단일 도구, 단일 턴 및 다중 도구, 다중 턴 및 단일 도구, 다중 턴 및 다중 도구, 그리고 분포 범위를 벗어난 작업)을 포함합니다. 또한, MTU-Bench의 모든 평가 메트릭은 GPT나 인간 평가 메트릭을 사용하지 않고 예측 결과와 실제 값에 기반합니다. 게다가, MTU-Bench는 기존 고품질 데이터셋을 변형하여 실제 도구 사용 시나리오를 시뮬레이션하고, 기존 LLM의 도구 사용 능력을 향상시키기 위해 MTU-Instruct 데이터라는 지시 데이터셋을 제안합니다. 포괄적인 실험 결과가 우리의 MTU-Bench의 효과를 입증합니다. 코드와 데이터는 https://github.com/MTU-Bench-Team/MTU-Bench.git에서 공개될 예정입니다.

English

Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing datasets have the following limitations: (1). Insufficient evaluation scenarios (e.g., only cover limited tool-use scenes). (2). Extensive evaluation costs (e.g., GPT API costs). To address these limitations, in this work, we propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. For the "multi-granularity" property, our MTU-Bench covers five tool usage scenes (i.e., single-turn and single-tool, single-turn and multiple-tool, multiple-turn and single-tool, multiple-turn and multiple-tool, and out-of-distribution tasks). Besides, all evaluation metrics of our MTU-Bench are based on the prediction results and the ground truth without using any GPT or human evaluation metrics. Moreover, our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios, and we also propose an instruction dataset called MTU-Instruct data to enhance the tool-use abilities of existing LLMs. Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench. Code and data will be released at https: //github.com/MTU-Bench-Team/MTU-Bench.git.

MTU-Bench: 대규모 언어 모델을 위한 다중 미세먼지도구 벤치마크

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

초록

Support