ChatPaper.aiChatPaper

MultiAgentBench:评估LLM智能体的协作与竞争能力

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

March 3, 2025
作者: Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You
cs.AI

摘要

大型语言模型(LLMs)作为自主代理展现出了卓越的能力,然而现有基准测试要么聚焦于单代理任务,要么局限于狭窄领域,未能捕捉多代理协作与竞争的动态特性。本文提出了MultiAgentBench,一个旨在评估基于LLM的多代理系统在多样化交互场景中表现的综合性基准测试。我们的框架不仅衡量任务完成度,还通过新颖的、基于里程碑的关键性能指标来评估协作与竞争的质量。此外,我们评估了多种协调协议(包括星型、链式、树状和图结构拓扑)以及创新策略,如群体讨论和认知规划。值得注意的是,gpt-4o-mini在研究场景中达到了平均最高任务得分,图结构在协调协议中表现最佳,而认知规划将里程碑达成率提升了3%。代码与数据集已公开于https://github.com/MultiagentBench/MARBLE。
English
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.

Summary

AI-Generated Summary

PDF243March 5, 2025