ChatPaper.aiChatPaper

从越狱到破解

Jailbreaking to Jailbreak

February 9, 2025
作者: Jeremy Kritz, Vaughn Robinson, Robert Vacareanu, Bijan Varjavand, Michael Choi, Bobby Gogov, Scale Red Team, Summer Yue, Willow E. Primack, Zifan Wang
cs.AI

摘要

大型语言模型(LLMs)的拒绝训练旨在防止有害输出,然而这一防御机制仍易受自动化及人工设计的越狱攻击。我们提出了一种新颖的LLM作为红队方法,即通过人工越狱一个经过拒绝训练的LLM,使其愿意自我越狱或协助其他LLM越狱。我们将这些被越狱的LLM称为J_2攻击者,它们能够运用多种红队策略系统评估目标模型,并通过从先前失败中进行的上下文学习提升其性能。实验表明,Sonnet 3.5和Gemini 1.5 pro作为J_2表现优于其他LLM,在Harmbench上对GPT-4o(及其他类似能力的LLM)分别达到了93.0%和91.0%的攻击成功率(ASR)。我们的工作不仅借鉴人类红队经验,引入了一种可扩展的战略性红队方法,还揭示了“越狱以越狱”作为安全防护机制中被忽视的失效模式。具体而言,一个LLM可通过利用其自身愿意协助进一步越狱的越狱版本,绕过自身的安全防护。为防止J_2的直接滥用,同时推动AI安全研究,我们公开了方法论,但保留了具体提示细节的私密性。
English
Refusal training on Large Language Models (LLMs) prevents harmful outputs, yet this defense remains vulnerable to both automated and human-crafted jailbreaks. We present a novel LLM-as-red-teamer approach in which a human jailbreaks a refusal-trained LLM to make it willing to jailbreak itself or other LLMs. We refer to the jailbroken LLMs as J_2 attackers, which can systematically evaluate target models using various red teaming strategies and improve its performance via in-context learning from the previous failures. Our experiments demonstrate that Sonnet 3.5 and Gemini 1.5 pro outperform other LLMs as J_2, achieving 93.0% and 91.0% attack success rates (ASRs) respectively against GPT-4o (and similar results across other capable LLMs) on Harmbench. Our work not only introduces a scalable approach to strategic red teaming, drawing inspiration from human red teamers, but also highlights jailbreaking-to-jailbreak as an overlooked failure mode of the safeguard. Specifically, an LLM can bypass its own safeguards by employing a jailbroken version of itself that is willing to assist in further jailbreaking. To prevent any direct misuse with J_2, while advancing research in AI safety, we publicly share our methodology while keeping specific prompting details private.

Summary

AI-Generated Summary

PDF42February 17, 2025