若您满意,这便是您的Doge:探索多LLM混合模型中的欺骗与鲁棒性
This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs
March 7, 2025
作者: Lorenz Wolf, Sangwoong Yoon, Ilija Bogunovic
cs.AI
摘要
大语言模型(LLMs)代理混合架构(MoA)通过推理时多个LLM的协作,在AlpacaEval 2.0等知名基准测试中取得了顶尖性能。尽管取得了这些成功,但关于MoA安全性和可靠性的评估尚属空白。我们首次全面研究了MoA在面对故意提供误导性回答的欺骗性LLM代理时的鲁棒性。我们考察了欺骗信息传播、模型规模及信息可用性等因素,揭示了关键漏洞。在AlpacaEval 2.0上,流行的LLaMA 3.1-70B模型结合三层MoA(6个LLM代理)时,长度控制胜率(LC WR)达到49.2%。然而,我们证明,仅需向MoA中引入一个精心指令的欺骗性代理,即可将性能降至37.9%,完全抵消了MoA的所有增益。在QuALITY这一多项选择理解任务中,影响同样严重,准确率惊人地下降了48.5%。部分灵感来源于历史上威尼斯总督选举过程,该过程旨在最小化影响与欺骗,我们提出了一系列无监督防御机制,能够恢复大部分损失的性能。
English
Mixture of large language model (LLMs) Agents (MoA) architectures achieve
state-of-the-art performance on prominent benchmarks like AlpacaEval 2.0 by
leveraging the collaboration of multiple LLMs at inference time. Despite these
successes, an evaluation of the safety and reliability of MoA is missing. We
present the first comprehensive study of MoA's robustness against deceptive LLM
agents that deliberately provide misleading responses. We examine factors like
the propagation of deceptive information, model size, and information
availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the
popular LLaMA 3.1-70B model achieves a length-controlled Win Rate (LC WR) of
49.2% when coupled with 3-layer MoA (6 LLM agents). However, we demonstrate
that introducing only a single carefully-instructed deceptive agent
into the MoA can reduce performance to 37.9%, effectively nullifying all MoA
gains. On QuALITY, a multiple-choice comprehension task, the impact is also
severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the
historical Doge of Venice voting process, designed to minimize influence and
deception, we propose a range of unsupervised defense mechanisms that recover
most of the lost performance.Summary
AI-Generated Summary