透過預調整和後調整模型合併來保護精細調整的LLMs

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

December 27, 2024
作者: Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee
cs.AI

摘要

對於下游任務,微調大型語言模型(LLMs)是一種廣泛採用的方法,但往往會導致安全導向的LLMs安全性下降。目前,許多解決方案通過納入額外的安全數據來解決這個問題,但在許多情況下這可能不切實際。本文探討了一個問題:在不依賴額外的安全數據的情況下,我們如何在LLMs中提高下游任務性能的同時保持安全性?我們提出了一種簡單而有效的方法,該方法保持LLMs的固有安全性,同時增強它們的下游任務性能:合併預微調和後微調的安全導向模型的權重。跨不同下游任務、模型和合併方法的實驗結果表明,這種方法有效地減輕了安全性下降,同時提高了下游任務的性能,為適應安全導向的LLMs提供了一個實用的解決方案。
English
Fine-tuning large language models (LLMs) for downstream tasks is a widely adopted approach, but it often leads to safety degradation in safety-aligned LLMs. Currently, many solutions address this issue by incorporating additional safety data, which can be impractical in many cases. In this paper, we address the question: How can we improve downstream task performance while preserving safety in LLMs without relying on additional safety data? We propose a simple and effective method that maintains the inherent safety of LLMs while enhancing their downstream task performance: merging the weights of pre- and post-fine-tuned safety-aligned models. Experimental results across various downstream tasks, models, and merging methods demonstrate that this approach effectively mitigates safety degradation while improving downstream task performance, offering a practical solution for adapting safety-aligned LLMs.

Summary

AI-Generated Summary

PDF82December 30, 2024