X-团队协作：基于自适应多智能体的多轮越狱攻击与防御

摘要

与语言模型（LMs）的多轮交互带来了关键的安全风险，因为有害意图可能策略性地分散在对话中。然而，绝大多数先前的研究集中于单轮安全性，而适应性和多样性仍是多轮红队测试面临的主要挑战。为应对这些挑战，我们提出了X-Teaming，一个可扩展的框架，系统性地探索看似无害的交互如何升级为有害结果，并生成相应的攻击场景。X-Teaming采用协作代理进行规划、攻击优化和验证，在多轮越狱的有效性和多样性上达到了业界领先水平，在代表性的领先开源和闭源模型上成功率高达98.1%。特别是，X-Teaming对最新的Claude 3.7 Sonnet模型实现了96.2%的攻击成功率，该模型曾被认为几乎免疫于单轮攻击。基于X-Teaming，我们推出了XGuard-Train，一个开源的多轮安全训练数据集，其规模是之前最佳资源的20倍，包含3万次交互式越狱，旨在为LMs实现稳健的多轮安全对齐。我们的工作为缓解复杂的对话攻击提供了必要的工具和洞见，推动了LMs在多轮安全性上的进步。

English

Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.

X-团队协作：基于自适应多智能体的多轮越狱攻击与防御

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

摘要

Summary

Support

Support