透過預調整和後調整模型合併來保護精細調整的LLMs

摘要

對於下游任務，微調大型語言模型（LLMs）是一種廣泛採用的方法，但往往會導致安全導向的LLMs安全性下降。目前，許多解決方案通過納入額外的安全數據來解決這個問題，但在許多情況下這可能不切實際。本文探討了一個問題：在不依賴額外的安全數據的情況下，我們如何在LLMs中提高下游任務性能的同時保持安全性？我們提出了一種簡單而有效的方法，該方法保持LLMs的固有安全性，同時增強它們的下游任務性能：合併預微調和後微調的安全導向模型的權重。跨不同下游任務、模型和合併方法的實驗結果表明，這種方法有效地減輕了安全性下降，同時提高了下游任務的性能，為適應安全導向的LLMs提供了一個實用的解決方案。

English

Fine-tuning large language models (LLMs) for downstream tasks is a widely adopted approach, but it often leads to safety degradation in safety-aligned LLMs. Currently, many solutions address this issue by incorporating additional safety data, which can be impractical in many cases. In this paper, we address the question: How can we improve downstream task performance while preserving safety in LLMs without relying on additional safety data? We propose a simple and effective method that maintains the inherent safety of LLMs while enhancing their downstream task performance: merging the weights of pre- and post-fine-tuned safety-aligned models. Experimental results across various downstream tasks, models, and merging methods demonstrate that this approach effectively mitigates safety degradation while improving downstream task performance, offering a practical solution for adapting safety-aligned LLMs.

透過預調整和後調整模型合併來保護精細調整的LLMs

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

摘要

Support