病毒：绕过护栏调控的大型语言模型有害微调攻击

摘要

最近的研究表明，大型语言模型（LLMs）容易受到有害微调攻击的影响——在少量有害样本上进行微调后，模型会失去其安全对齐能力。为了降低风险，通常会使用防护栏来在微调之前过滤掉有害样本。通过设计一种新的红队方法，本文展示了仅依赖于调节防护栏进行数据过滤是不可靠的。我们提出的攻击方法被称为“病毒”，可以轻松地绕过防护栏的调节，通过轻微修改有害数据。实验结果表明，通过“病毒”优化的有害数据在高达100\%泄漏比率的情况下无法被防护栏检测到，并且可以同时实现更优越的攻击性能。最后，我们通过本文要传达的关键信息是：认为防护栏调节可以解决预训练LLMs固有的安全问题是不负责任的。我们的代码可在https://github.com/git-disl/Virus找到。

English

Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus

病毒：绕过护栏调控的大型语言模型有害微调攻击

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

摘要

Summary

Support