前DPO階段：利用引導參考模型提升直接偏好優化中的數據利用率

摘要

直接偏好優化（Direct Preference Optimization, DPO）簡化了大型語言模型（LLMs）從人類反饋中進行強化學習（RLHF）的過程，它直接優化人類偏好，而無需顯式的獎勵模型。我們發現，在DPO訓練過程中，參考模型扮演了數據權重調整器的角色。然而，DPO中常見的將策略模型與參考模型初始化的做法，可能導致數據利用效率低下，並設定性能上限。同時，簡單偏好優化（Simple Preference Optimization, SimPO）缺乏參考模型，這降低了訓練的穩健性，並需要更嚴格的條件來防止災難性遺忘。在本研究中，我們提出了Pre-DPO，這是一種基於DPO的簡單而有效的訓練範式，它通過利用指導性參考模型來提升偏好優化的性能。該參考模型提供了通過訓練偏好數據可達到的理想策略狀態的前瞻性，作為一種指導機制，自適應地為更適合模型的樣本分配更高的權重，而對不太適合的樣本分配較低的權重。在AlpacaEval 2.0和Arena-Hard v0.1基準上的廣泛實驗表明，Pre-DPO持續提升了DPO和SimPO的性能，且無需依賴外部模型或額外數據。

English

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

前DPO階段：利用引導參考模型提升直接偏好優化中的數據利用率

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

摘要

Summary

Support

Support