简单交互：通过简单交互从大型语言模型中引发有害越狱

摘要

尽管进行了大量的安全对齐工作，但大型语言模型（LLMs）仍然容易受到越狱攻击的影响，从而引发有害行为。尽管现有研究主要集中在需要技术专业知识的攻击方法上，但仍有两个关键问题尚未得到充分探讨：（1）越狱响应是否真正有助于普通用户执行有害行为？（2）在更常见、简单的人-LLM交互中是否存在安全漏洞？在本文中，我们展示了LLM响应在促成有害行为时最有效的方式是同时具有可操作性和信息性——这两个属性在多步骤、多语言交互中很容易引发。基于这一见解，我们提出了HarmScore，一个衡量LLM响应有效促成有害行为程度的越狱度量标准，以及Speak Easy，一个简单的多步骤、多语言攻击框架。值得注意的是，通过将Speak Easy纳入直接请求和越狱基线，我们观察到在四个安全基准中，无论是开源还是专有LLMs，攻击成功率平均绝对增加了0.319，HarmScore增加了0.426。我们的工作揭示了一个关键但常被忽视的漏洞：恶意用户可以轻易利用常见的交互模式实现有害意图。

English

Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative--two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of 0.319 in Attack Success Rate and 0.426 in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.

简单交互：通过简单交互从大型语言模型中引发有害越狱

Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

摘要

Summary

Support