調控干預偏好優化（MIPO）：保留簡單，優化困難

摘要

偏好優化方法通常會以一個訓練有素的SFT模型作為參考模型開始訓練。在RLHF和DPO中，在偏好優化過程中使用一個正則化項，以防止策略模型偏離過遠參考模型的分佈，從而避免生成異常回應。當參考模型已經與給定數據很好地對齊或僅需要輕微調整時，這種方法可以產生一個良好對齊的模型。然而，如果參考模型與給定數據不對齊並且需要從其當前狀態明顯偏離，正則化項實際上可能會妨礙模型對齊。在本研究中，我們提出了調節干預偏好優化（MIPO）來解決這個問題。MIPO根據給定數據與參考模型對齊的程度調節從參考模型的干預程度。如果數據對齊良好，則增加干預以防止策略模型明顯偏離參考模型。相反，如果對齊不佳，則減少干擾以促進更廣泛的訓練。我們使用Mistral-7B和Llama3-8B在Alpaca Eval 2.0和MT-Bench上比較MIPO和DPO的性能。實驗結果表明，在各種評估場景中，MIPO始終優於DPO。

English

Preference optimization methods typically begin training with a well-trained SFT model as a reference model. In RLHF and DPO, a regularization term is used during the preference optimization process to prevent the policy model from deviating too far from the reference model's distribution, thereby avoiding the generation of anomalous responses. When the reference model is already well-aligned with the given data or only requires slight adjustments, this approach can produce a well-aligned model. However, if the reference model is not aligned with the given data and requires significant deviation from its current state, a regularization term may actually hinder the model alignment. In this study, we propose Modulated Intervention Preference Optimization (MIPO) to address this issue. MIPO modulates the degree of intervention from the reference model based on how well the given data is aligned with it. If the data is well-aligned, the intervention is increased to prevent the policy model from diverging significantly from reference model. Conversely, if the alignment is poor, the interference is reduced to facilitate more extensive training. We compare the performance of MIPO and DPO using Mistral-7B and Llama3-8B in Alpaca Eval 2.0 and MT-Bench. The experimental results demonstrate that MIPO consistently outperforms DPO across various evaluation scenarios.

調控干預偏好優化（MIPO）：保留簡單，優化困難

Modulated Intervention Preference Optimization (MIPO): Keep the Easy, Refine the Difficult

摘要

Summary

Support

Support