激活近似即使在对齐的LLM中也可能导致安全漏洞:全面分析与防御
Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and Defense
February 2, 2025
作者: Jiawen Zhang, Kejia Chen, Lipeng He, Jian Lou, Dan Li, Zunlei Feng, Mingli Song, Jian Liu, Kui Ren, Xiaohu Yang
cs.AI
摘要
大型语言模型(LLMs)在各个领域展示了卓越的能力。随着LLMs不断发展的能力和不断扩大的部署场景,由于其庞大规模和先进而复杂的激活设计(如Llama、Gemma和Mistral等知名模型系列),它们的部署挑战也在不断升级。这些挑战在资源受限的部署场景中尤为突出,因此缓解推理效率瓶颈至关重要。在各种最近的努力中,激活近似已经成为追求推理效率的一个有前途的途径,有时被认为在私密推理等应用中是不可或缺的。尽管在实用性上取得了实质性的加速,对实际部署而言看起来既合理又实用,但激活近似的安全性影响仍不明确。在这项工作中,我们通过对激活近似进行首次系统化安全评估来填补LLM安全领域的重要空白。我们的安全审查涵盖了三个流行类别中的七种最新技术,揭示了十个与安全对齐的LLMs中一致的安全降级情况。
English
Large Language Models (LLMs) have showcased remarkable capabilities across
various domains. Accompanying the evolving capabilities and expanding
deployment scenarios of LLMs, their deployment challenges escalate due to their
sheer scale and the advanced yet complex activation designs prevalent in
notable model series, such as Llama, Gemma, and Mistral. These challenges have
become particularly pronounced in resource-constrained deployment scenarios,
where mitigating inference efficiency bottlenecks is imperative. Among various
recent efforts, activation approximation has emerged as a promising avenue for
pursuing inference efficiency, sometimes considered indispensable in
applications such as private inference. Despite achieving substantial speedups
with minimal impact on utility, even appearing sound and practical for
real-world deployment, the safety implications of activation approximations
remain unclear. In this work, we fill this critical gap in LLM safety by
conducting the first systematic safety evaluation of activation approximations.
Our safety vetting spans seven sota techniques across three popular categories,
revealing consistent safety degradation across ten safety-aligned LLMs.Summary
AI-Generated Summary