X-团队协作:基于自适应多智能体的多轮越狱攻击与防御
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
April 15, 2025
作者: Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel
cs.AI
摘要
与语言模型(LMs)的多轮交互带来了关键的安全风险,因为有害意图可能策略性地分散在对话中。然而,绝大多数先前的研究集中于单轮安全性,而适应性和多样性仍是多轮红队测试面临的主要挑战。为应对这些挑战,我们提出了X-Teaming,一个可扩展的框架,系统性地探索看似无害的交互如何升级为有害结果,并生成相应的攻击场景。X-Teaming采用协作代理进行规划、攻击优化和验证,在多轮越狱的有效性和多样性上达到了业界领先水平,在代表性的领先开源和闭源模型上成功率高达98.1%。特别是,X-Teaming对最新的Claude 3.7 Sonnet模型实现了96.2%的攻击成功率,该模型曾被认为几乎免疫于单轮攻击。基于X-Teaming,我们推出了XGuard-Train,一个开源的多轮安全训练数据集,其规模是之前最佳资源的20倍,包含3万次交互式越狱,旨在为LMs实现稳健的多轮安全对齐。我们的工作为缓解复杂的对话攻击提供了必要的工具和洞见,推动了LMs在多轮安全性上的进步。
English
Multi-turn interactions with language models (LMs) pose critical safety
risks, as harmful intent can be strategically spread across exchanges. Yet, the
vast majority of prior work has focused on single-turn safety, while
adaptability and diversity remain among the key challenges of multi-turn
red-teaming. To address these challenges, we present X-Teaming, a scalable
framework that systematically explores how seemingly harmless interactions
escalate into harmful outcomes and generates corresponding attack scenarios.
X-Teaming employs collaborative agents for planning, attack optimization, and
verification, achieving state-of-the-art multi-turn jailbreak effectiveness and
diversity with success rates up to 98.1% across representative leading
open-weight and closed-source models. In particular, X-Teaming achieves a 96.2%
attack success rate against the latest Claude 3.7 Sonnet model, which has been
considered nearly immune to single-turn attacks. Building on X-Teaming, we
introduce XGuard-Train, an open-source multi-turn safety training dataset that
is 20x larger than the previous best resource, comprising 30K interactive
jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our
work offers essential tools and insights for mitigating sophisticated
conversational attacks, advancing the multi-turn safety of LMs.Summary
AI-Generated Summary