X-Teaming:基於自適應多代理的多輪越獄與防禦
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
April 15, 2025
作者: Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel
cs.AI
摘要
與語言模型(LMs)的多輪互動存在重大安全風險,因為有害意圖可能策略性地分散在對話中。然而,絕大多數先前的研究都集中在單輪安全性上,而適應性和多樣性仍然是多輪紅隊測試的關鍵挑戰。為應對這些挑戰,我們提出了X-Teaming,這是一個可擴展的框架,系統性地探索看似無害的互動如何升級為有害結果,並生成相應的攻擊場景。X-Teaming採用協作代理進行規劃、攻擊優化和驗證,在多輪越獄效果和多樣性方面達到了最先進的水平,在代表性的領先開源和閉源模型上成功率最高可達98.1%。特別是,X-Teaming對最新的Claude 3.7 Sonnet模型達到了96.2%的攻擊成功率,該模型曾被認為幾乎對單輪攻擊免疫。基於X-Teaming,我們推出了XGuard-Train,這是一個開源的多輪安全訓練數據集,規模是先前最佳資源的20倍,包含30K個互動式越獄案例,旨在為LMs實現穩健的多輪安全對齊。我們的工作為緩解複雜的對話攻擊提供了必要的工具和見解,推動了LMs的多輪安全性發展。
English
Multi-turn interactions with language models (LMs) pose critical safety
risks, as harmful intent can be strategically spread across exchanges. Yet, the
vast majority of prior work has focused on single-turn safety, while
adaptability and diversity remain among the key challenges of multi-turn
red-teaming. To address these challenges, we present X-Teaming, a scalable
framework that systematically explores how seemingly harmless interactions
escalate into harmful outcomes and generates corresponding attack scenarios.
X-Teaming employs collaborative agents for planning, attack optimization, and
verification, achieving state-of-the-art multi-turn jailbreak effectiveness and
diversity with success rates up to 98.1% across representative leading
open-weight and closed-source models. In particular, X-Teaming achieves a 96.2%
attack success rate against the latest Claude 3.7 Sonnet model, which has been
considered nearly immune to single-turn attacks. Building on X-Teaming, we
introduce XGuard-Train, an open-source multi-turn safety training dataset that
is 20x larger than the previous best resource, comprising 30K interactive
jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our
work offers essential tools and insights for mitigating sophisticated
conversational attacks, advancing the multi-turn safety of LMs.Summary
AI-Generated Summary