ChatPaper.aiChatPaper

宪法分类器:抵御跨数千小时的红队行动中的通用越狱

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

January 31, 2025
作者: Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez
cs.AI

摘要

大型语言模型(LLMs)容易受到通用越狱攻击的影响,这些攻击策略系统性地绕过模型保障措施,使用户能够执行需要多次模型交互的有害流程,比如大规模制造非法物质。为了抵御这些攻击,我们引入了宪法分类器:在合成数据上训练的保障措施,通过用自然语言规则(即宪法)提示LLMs生成指定的允许和受限内容。在超过3,000个估计小时的红队行动中,没有一个红队成员找到一种通用越狱方法,可以从早期经过分类器保护的LLM中提取信息,且在大多数目标查询中与未受保护的模型具有相似的详细程度。在自动化评估中,增强的分类器展示了对领域特定越狱攻击的强大防御能力。这些分类器还保持了部署可行性,在生产流量拒绝率上绝对增加了0.38%,推理开销增加了23.7%。我们的工作表明,抵御通用越狱攻击并同时保持实际部署可行性是可行的。
English
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

Summary

AI-Generated Summary

PDF105February 3, 2025