病毒:绕过护栏调控的大型语言模型有害微调攻击
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
January 29, 2025
作者: Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu
cs.AI
摘要
最近的研究表明,大型语言模型(LLMs)容易受到有害微调攻击的影响——在少量有害样本上进行微调后,模型会失去其安全对齐能力。为了降低风险,通常会使用防护栏来在微调之前过滤掉有害样本。通过设计一种新的红队方法,本文展示了仅依赖于调节防护栏进行数据过滤是不可靠的。我们提出的攻击方法被称为“病毒”,可以轻松地绕过防护栏的调节,通过轻微修改有害数据。实验结果表明,通过“病毒”优化的有害数据在高达100\%泄漏比率的情况下无法被防护栏检测到,并且可以同时实现更优越的攻击性能。最后,我们通过本文要传达的关键信息是:认为防护栏调节可以解决预训练LLMs固有的安全问题是不负责任的。我们的代码可在https://github.com/git-disl/Virus找到。
English
Recent research shows that Large Language Models (LLMs) are vulnerable to
harmful fine-tuning attacks -- models lose their safety alignment ability after
fine-tuning on a few harmful samples. For risk mitigation, a guardrail is
typically used to filter out harmful samples before fine-tuning. By designing a
new red-teaming method, we in this paper show that purely relying on the
moderation guardrail for data filtration is not reliable. Our proposed attack
method, dubbed Virus, easily bypasses the guardrail moderation by slightly
modifying the harmful data. Experimental results show that the harmful data
optimized by Virus is not detectable by the guardrail with up to 100\% leakage
ratio, and can simultaneously achieve superior attack performance. Finally, the
key message we want to convey through this paper is that: it is
reckless to consider guardrail moderation as a clutch at straws towards harmful
fine-tuning attack, as it cannot solve the inherent safety issue of the
pre-trained LLMs. Our code is available at https://github.com/git-disl/VirusSummary
AI-Generated Summary