ChatPaper.aiChatPaper

为何受保护的船只仍会搁浅?对齐大语言模型的安全机制往往固守于模板区域

Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region

February 19, 2025
作者: Chak Tou Leong, Qingyu Yin, Jian Wang, Wenjie Li
cs.AI

摘要

大型语言模型(LLMs)的安全对齐仍存在脆弱性,其初始行为极易被相对简单的攻击所破解。鉴于现有LLMs普遍采用在输入指令与初始模型输出之间填充固定模板的做法,我们推测这一模板正是其脆弱性的关键所在:LLMs的安全相关决策过度依赖于模板区域聚合的信息,这极大地影响了模型的安全行为。我们将此问题称为模板锚定的安全对齐。本文通过大量实验验证了模板锚定的安全对齐在各类对齐后的LLMs中普遍存在。我们的机制分析揭示了它如何导致模型在遭遇推理时破解攻击时易受攻击。此外,我们展示了将安全机制与模板区域分离在缓解破解攻击脆弱性方面具有潜力。我们鼓励未来研究开发更为鲁棒的安全对齐技术,减少对模板区域的依赖。
English
The safety alignment of large language models (LLMs) remains vulnerable, as their initial behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed template between the input instruction and initial model output is a common practice for existing LLMs, we hypothesize that this template is a key factor behind their vulnerabilities: LLMs' safety-related decision-making overly relies on the aggregated information from the template region, which largely influences these models' safety behavior. We refer to this issue as template-anchored safety alignment. In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs. Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks. Furthermore, we show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks. We encourage future research to develop more robust safety alignment techniques that reduce reliance on the template region.

Summary

AI-Generated Summary

PDF92February 20, 2025