前DPO階段:利用引導參考模型提升直接偏好優化中的數據利用率
Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model
April 22, 2025
作者: Junshu Pan, Wei Shen, Shulin Huang, Qiji Zhou, Yue Zhang
cs.AI
摘要
直接偏好優化(Direct Preference Optimization, DPO)簡化了大型語言模型(LLMs)從人類反饋中進行強化學習(RLHF)的過程,它直接優化人類偏好,而無需顯式的獎勵模型。我們發現,在DPO訓練過程中,參考模型扮演了數據權重調整器的角色。然而,DPO中常見的將策略模型與參考模型初始化的做法,可能導致數據利用效率低下,並設定性能上限。同時,簡單偏好優化(Simple Preference Optimization, SimPO)缺乏參考模型,這降低了訓練的穩健性,並需要更嚴格的條件來防止災難性遺忘。在本研究中,我們提出了Pre-DPO,這是一種基於DPO的簡單而有效的訓練範式,它通過利用指導性參考模型來提升偏好優化的性能。該參考模型提供了通過訓練偏好數據可達到的理想策略狀態的前瞻性,作為一種指導機制,自適應地為更適合模型的樣本分配更高的權重,而對不太適合的樣本分配較低的權重。在AlpacaEval 2.0和Arena-Hard v0.1基準上的廣泛實驗表明,Pre-DPO持續提升了DPO和SimPO的性能,且無需依賴外部模型或額外數據。
English
Direct Preference Optimization (DPO) simplifies reinforcement learning from
human feedback (RLHF) for large language models (LLMs) by directly optimizing
human preferences without an explicit reward model. We find that during DPO
training, the reference model plays the role of a data weight adjuster.
However, the common practice of initializing the policy and reference models
identically in DPO can lead to inefficient data utilization and impose a
performance ceiling. Meanwhile, the lack of a reference model in Simple
Preference Optimization (SimPO) reduces training robustness and necessitates
stricter conditions to prevent catastrophic forgetting. In this work, we
propose Pre-DPO, a simple yet effective DPO-based training paradigm that
enhances preference optimization performance by leveraging a guiding reference
model. This reference model provides foresight into the optimal policy state
achievable through the training preference data, serving as a guiding mechanism
that adaptively assigns higher weights to samples more suitable for the model
and lower weights to those less suitable. Extensive experiments on AlpacaEval
2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently
improves the performance of both DPO and SimPO, without relying on external
models or additional data.Summary
AI-Generated Summary