왜 안전장치가 있는 배가 좌초하는가? 정렬된 대형 언어 모델의 안전 메커니즘은 템플릿 영역에 고정되는 경향이 있다

초록

대규모 언어 모델(LLM)의 안전성 정렬은 여전히 취약하며, 상대적으로 단순한 공격에도 초기 행동이 쉽게 '탈옥(jailbroken)'될 수 있습니다. 기존 LLM에서는 입력 지시와 초기 모델 출력 사이에 고정된 템플릿을 삽입하는 것이 일반적인 관행인데, 우리는 이 템플릿이 이러한 취약성의 핵심 요인이라고 가정합니다: LLM의 안전 관련 의사결정은 템플릿 영역에서 집계된 정보에 지나치게 의존하며, 이는 모델의 안전 행동에 큰 영향을 미칩니다. 우리는 이 문제를 '템플릿 고정형 안전성 정렬(template-anchored safety alignment)'이라고 부릅니다. 본 논문에서는 광범위한 실험을 통해 템플릿 고정형 안전성 정렬이 다양한 정렬된 LLM에 걸쳐 널리 퍼져 있음을 확인했습니다. 우리의 기계적 분석은 이 문제가 추론 시간에 발생하는 탈옥 공격에 대한 모델의 취약성을 어떻게 초래하는지를 보여줍니다. 또한, 안전 메커니즘을 템플릿 영역에서 분리하는 것이 탈옥 공격에 대한 취약성을 완화하는 데 유망하다는 것을 입증합니다. 우리는 향후 연구가 템플릿 영역에 대한 의존도를 줄이는 더 강력한 안전성 정렬 기술을 개발할 것을 권장합니다.

English

The safety alignment of large language models (LLMs) remains vulnerable, as their initial behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed template between the input instruction and initial model output is a common practice for existing LLMs, we hypothesize that this template is a key factor behind their vulnerabilities: LLMs' safety-related decision-making overly relies on the aggregated information from the template region, which largely influences these models' safety behavior. We refer to this issue as template-anchored safety alignment. In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs. Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks. Furthermore, we show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks. We encourage future research to develop more robust safety alignment techniques that reduce reliance on the template region.

왜 안전장치가 있는 배가 좌초하는가? 정렬된 대형 언어 모델의 안전 메커니즘은 템플릿 영역에 고정되는 경향이 있다

Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region

초록

Summary

Support