重新思考代理混合:混合不同的大型语言模型是否有益?
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?
February 2, 2025
作者: Wenzhe Li, Yong Lin, Mengzhou Xia, Chi Jin
cs.AI
摘要
将来自不同来源的输出进行集成是一种简单但有效的提升性能的方法。混合式代理(MoA)是一种流行的集成方法,它聚合了多个不同的大型语言模型(LLMs)的输出。本文在语言模型的背景下提出了一个问题:混合不同的LLMs是否真的有益?我们提出了自我MoA——一种只聚合来自单个表现最佳的LLM的输出的集成方法。我们的大量实验表明,令人惊讶的是,自我MoA在许多场景中优于混合不同LLMs的标准MoA:在AlpacaEval 2.0基准测试中,自我MoA比MoA提高了6.6%,在包括MMLU、CRUX和MATH在内的各种基准测试中平均提高了3.8%。将Self-MoA应用于AlpacaEval 2.0中排名靠前的模型,直接实现了排行榜上的最新最佳性能。为了了解Self-MoA的有效性,我们系统地研究了在各种MoA设置下多样性和输出质量之间的权衡。我们确认MoA的性能对质量非常敏感,混合不同的LLMs通常会降低模型的平均质量。为了补充这项研究,我们确定了混合不同LLMs可能有益的情景。本文进一步介绍了Self-MoA的顺序版本,能够在多轮中动态地聚合大量LLM输出,其效果与一次性聚合所有输出一样有效。
English
Ensembling outputs from diverse sources is a straightforward yet effective
approach to boost performance. Mixture-of-Agents (MoA) is one such popular
ensemble method that aggregates outputs from multiple different Large Language
Models (LLMs). This paper raises the question in the context of language
models: is mixing different LLMs truly beneficial? We propose Self-MoA -- an
ensemble method that aggregates outputs from only the single top-performing
LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms
standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA
achieves 6.6% improvement over MoA on the AlpacaEval 2.0 benchmark, and an
average of 3.8% improvement across various benchmarks, including MMLU, CRUX,
and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0
directly achieves the new state-of-the-art performance on the leaderboard. To
understand the effectiveness of Self-MoA, we systematically investigate the
trade-off between diversity and quality of outputs under various MoA settings.
We confirm that the MoA performance is rather sensitive to the quality, and
mixing different LLMs often lowers the average quality of the models. To
complement the study, we identify the scenarios where mixing different LLMs
could be helpful. This paper further introduces a sequential version of
Self-MoA, that is capable of aggregating a large number of LLM outputs
on-the-fly over multiple rounds, and is as effective as aggregating all outputs
at once.Summary
AI-Generated Summary