通过预调整和后调整模型合并来保护经过精细调整的LLMs。

摘要

对大型语言模型（LLMs）进行下游任务微调是一种被广泛采纳的方法，但往往会导致安全对齐的LLMs安全性下降。目前，许多解决方案通过纳入额外的安全数据来解决这个问题，但在许多情况下这种做法并不切实际。本文探讨了一个问题：在不依赖额外安全数据的情况下，我们如何在保持LLMs安全性的同时提高下游任务性能？我们提出了一种简单而有效的方法，即合并预微调和后微调的安全对齐模型的权重，以保持LLMs的固有安全性并增强它们的下游任务性能。跨越各种下游任务、模型和合并方法的实验结果表明，这种方法有效地缓解了安全性下降的问题，同时提高了下游任务的性能，为适应安全对齐的LLMs提供了一个实用的解决方案。

English

Fine-tuning large language models (LLMs) for downstream tasks is a widely adopted approach, but it often leads to safety degradation in safety-aligned LLMs. Currently, many solutions address this issue by incorporating additional safety data, which can be impractical in many cases. In this paper, we address the question: How can we improve downstream task performance while preserving safety in LLMs without relying on additional safety data? We propose a simple and effective method that maintains the inherent safety of LLMs while enhancing their downstream task performance: merging the weights of pre- and post-fine-tuned safety-aligned models. Experimental results across various downstream tasks, models, and merging methods demonstrate that this approach effectively mitigates safety degradation while improving downstream task performance, offering a practical solution for adapting safety-aligned LLMs.

通过预调整和后调整模型合并来保护经过精细调整的LLMs。

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

摘要

Summary

Support