通过预调整和后调整模型合并来保护经过精细调整的LLMs。

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

December 27, 2024
作者: Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee
cs.AI

摘要

对大型语言模型(LLMs)进行下游任务微调是一种被广泛采纳的方法,但往往会导致安全对齐的LLMs安全性下降。目前,许多解决方案通过纳入额外的安全数据来解决这个问题,但在许多情况下这种做法并不切实际。本文探讨了一个问题:在不依赖额外安全数据的情况下,我们如何在保持LLMs安全性的同时提高下游任务性能?我们提出了一种简单而有效的方法,即合并预微调和后微调的安全对齐模型的权重,以保持LLMs的固有安全性并增强它们的下游任务性能。跨越各种下游任务、模型和合并方法的实验结果表明,这种方法有效地缓解了安全性下降的问题,同时提高了下游任务的性能,为适应安全对齐的LLMs提供了一个实用的解决方案。
English
Fine-tuning large language models (LLMs) for downstream tasks is a widely adopted approach, but it often leads to safety degradation in safety-aligned LLMs. Currently, many solutions address this issue by incorporating additional safety data, which can be impractical in many cases. In this paper, we address the question: How can we improve downstream task performance while preserving safety in LLMs without relying on additional safety data? We propose a simple and effective method that maintains the inherent safety of LLMs while enhancing their downstream task performance: merging the weights of pre- and post-fine-tuned safety-aligned models. Experimental results across various downstream tasks, models, and merging methods demonstrate that this approach effectively mitigates safety degradation while improving downstream task performance, offering a practical solution for adapting safety-aligned LLMs.

Summary

AI-Generated Summary

PDF82December 30, 2024