事前および事後調整モデルのマージを通じて、LLMの微調整を保護する

要旨

大規模言語モデル（LLM）を下流タスクに微調整することは広く採用されていますが、安全性に配慮したLLMではしばしば安全性の低下を招きます。現在、多くの解決策がこの問題に取り組んでおり、追加の安全データを組み込むことで対処していますが、多くの場合実用的ではありません。本論文では、追加の安全データに頼らずにLLMの安全性を保ちつつ下流タスクのパフォーマンスを向上させる方法について問題提起します。我々は、事前および事後に微調整された安全性に配慮したモデルの重みを統合することで、LLMの固有の安全性を維持しつつ下流タスクのパフォーマンスを向上させるシンプルかつ効果的な手法を提案します。さまざまな下流タスク、モデル、および統合方法にわたる実験結果は、このアプローチが安全性の低下を効果的に緩和し、下流タスクのパフォーマンスを向上させることを示しており、安全性に配慮したLLMを適応させるための実用的な解決策を提供しています。

English

Fine-tuning large language models (LLMs) for downstream tasks is a widely adopted approach, but it often leads to safety degradation in safety-aligned LLMs. Currently, many solutions address this issue by incorporating additional safety data, which can be impractical in many cases. In this paper, we address the question: How can we improve downstream task performance while preserving safety in LLMs without relying on additional safety data? We propose a simple and effective method that maintains the inherent safety of LLMs while enhancing their downstream task performance: merging the weights of pre- and post-fine-tuned safety-aligned models. Experimental results across various downstream tasks, models, and merging methods demonstrate that this approach effectively mitigates safety degradation while improving downstream task performance, offering a practical solution for adapting safety-aligned LLMs.

事前および事後調整モデルのマージを通じて、LLMの微調整を保護する

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

要旨

Summary

Support