RainbowPlus: 進化的品質多様性探索による敵対的プロンプト生成の強化

要旨

大規模言語モデル（LLM）は驚異的な能力を発揮する一方で、脆弱性を突く敵対的なプロンプトによって安全でないまたは偏った出力を生成するリスクがあります。既存のレッドチーミング手法は、スケーラビリティの課題、リソース集約的な要件、または攻撃戦略の多様性の限界に直面することが多いです。本論文では、進化的計算に根ざした新しいレッドチーミングフレームワークであるRainbowPlusを提案します。RainbowPlusは、MAP-Elitesのような古典的な進化的アルゴリズムを言語モデル向けに拡張した適応的品質多様性（QD）探索を通じて、敵対的プロンプト生成を強化します。多要素アーカイブを使用して多様な高品質プロンプトを保存し、複数のプロンプトを同時に評価する包括的なフィットネス関数を採用することで、RainbowPlusは、Rainbow Teamingのような従来のQD手法における単一プロンプトアーカイブとペアワイズ比較の制約を克服します。6つのベンチマークデータセットと4つのオープンソースLLMを用いてRainbowPlusをQD手法と比較した実験では、優れた攻撃成功率（ASR）と多様性（Diverse-Score約0.84）を示し、最大100倍のユニークなプロンプトを生成しました（例：Ministral-8B-Instruct-2410で10,418対100）。12のLLM（10のオープンソース、2のクローズドソース）を用いたHarmBenchデータセットでの9つの最先端手法に対する評価では、RainbowPlusは平均ASR81.1%を達成し、AutoDAN-Turboを3.9%上回り、9倍高速でした（1.45時間対13.50時間）。私たちのオープンソース実装は、LLMの安全性向上に貢献し、脆弱性評価のためのスケーラブルなツールを提供します。コードとリソースはhttps://github.com/knoveleng/rainbowplusで公開されており、再現性と将来のLLMレッドチーミング研究を支援します。

English

Large Language Models (LLMs) exhibit remarkable capabilities but are susceptible to adversarial prompts that exploit vulnerabilities to produce unsafe or biased outputs. Existing red-teaming methods often face scalability challenges, resource-intensive requirements, or limited diversity in attack strategies. We propose RainbowPlus, a novel red-teaming framework rooted in evolutionary computation, enhancing adversarial prompt generation through an adaptive quality-diversity (QD) search that extends classical evolutionary algorithms like MAP-Elites with innovations tailored for language models. By employing a multi-element archive to store diverse high-quality prompts and a comprehensive fitness function to evaluate multiple prompts concurrently, RainbowPlus overcomes the constraints of single-prompt archives and pairwise comparisons in prior QD methods like Rainbow Teaming. Experiments comparing RainbowPlus to QD methods across six benchmark datasets and four open-source LLMs demonstrate superior attack success rate (ASR) and diversity (Diverse-Score approx 0.84), generating up to 100 times more unique prompts (e.g., 10,418 vs. 100 for Ministral-8B-Instruct-2410). Against nine state-of-the-art methods on the HarmBench dataset with twelve LLMs (ten open-source, two closed-source), RainbowPlus achieves an average ASR of 81.1%, surpassing AutoDAN-Turbo by 3.9%, and is 9 times faster (1.45 vs. 13.50 hours). Our open-source implementation fosters further advancements in LLM safety, offering a scalable tool for vulnerability assessment. Code and resources are publicly available at https://github.com/knoveleng/rainbowplus, supporting reproducibility and future research in LLM red-teaming.

RainbowPlus: 進化的品質多様性探索による敵対的プロンプト生成の強化

RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search

要旨

Summary

Support

Support