RainbowPlus:通过进化式质量多样性搜索增强对抗性提示生成
RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search
April 21, 2025
作者: Quy-Anh Dang, Chris Ngo, Truong-Son Hy
cs.AI
摘要
大型语言模型(LLMs)展现出卓越的能力,但也容易受到对抗性提示的攻击,这些提示利用模型漏洞产生不安全或带有偏见的输出。现有的红队方法常面临可扩展性挑战、资源需求高或攻击策略多样性有限的问题。我们提出了RainbowPlus,一种基于进化计算的新型红队框架,通过自适应质量多样性(QD)搜索增强对抗性提示生成,该搜索扩展了如MAP-Elites等经典进化算法,并针对语言模型进行了创新。通过采用多元素档案存储多样化的高质量提示,以及一个全面的适应度函数同时评估多个提示,RainbowPlus克服了先前QD方法(如Rainbow Teaming)中单一提示档案和成对比较的限制。在六个基准数据集和四个开源LLM上,RainbowPlus与QD方法的对比实验显示出更高的攻击成功率(ASR)和多样性(Diverse-Score约0.84),生成的独特提示数量最多可达100倍(例如,Ministral-8B-Instruct-2410模型下,10,418个对比100个)。在HarmBench数据集上,针对十二个LLM(十个开源,两个闭源)与九种最先进方法的较量中,RainbowPlus实现了81.1%的平均ASR,超越AutoDAN-Turbo 3.9%,且速度快了9倍(1.45小时对比13.50小时)。我们的开源实现促进了LLM安全性的进一步进步,提供了一个可扩展的漏洞评估工具。代码和资源公开于https://github.com/knoveleng/rainbowplus,支持LLM红队研究的可重复性和未来探索。
English
Large Language Models (LLMs) exhibit remarkable capabilities but are
susceptible to adversarial prompts that exploit vulnerabilities to produce
unsafe or biased outputs. Existing red-teaming methods often face scalability
challenges, resource-intensive requirements, or limited diversity in attack
strategies. We propose RainbowPlus, a novel red-teaming framework rooted in
evolutionary computation, enhancing adversarial prompt generation through an
adaptive quality-diversity (QD) search that extends classical evolutionary
algorithms like MAP-Elites with innovations tailored for language models. By
employing a multi-element archive to store diverse high-quality prompts and a
comprehensive fitness function to evaluate multiple prompts concurrently,
RainbowPlus overcomes the constraints of single-prompt archives and pairwise
comparisons in prior QD methods like Rainbow Teaming. Experiments comparing
RainbowPlus to QD methods across six benchmark datasets and four open-source
LLMs demonstrate superior attack success rate (ASR) and diversity
(Diverse-Score approx 0.84), generating up to 100 times more unique prompts
(e.g., 10,418 vs. 100 for Ministral-8B-Instruct-2410). Against nine
state-of-the-art methods on the HarmBench dataset with twelve LLMs (ten
open-source, two closed-source), RainbowPlus achieves an average ASR of 81.1%,
surpassing AutoDAN-Turbo by 3.9%, and is 9 times faster (1.45 vs. 13.50 hours).
Our open-source implementation fosters further advancements in LLM safety,
offering a scalable tool for vulnerability assessment. Code and resources are
publicly available at https://github.com/knoveleng/rainbowplus, supporting
reproducibility and future research in LLM red-teaming.Summary
AI-Generated Summary