ChatPaper.aiChatPaper

预DPO:通过引导参考模型提升直接偏好优化中的数据利用效率

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

April 22, 2025
作者: Junshu Pan, Wei Shen, Shulin Huang, Qiji Zhou, Yue Zhang
cs.AI

摘要

直接偏好优化(DPO)通过直接优化人类偏好,无需显式奖励模型,简化了大型语言模型(LLMs)从人类反馈中进行强化学习(RLHF)的过程。我们发现,在DPO训练期间,参考模型扮演了数据权重调节器的角色。然而,DPO中常见的将策略模型与参考模型初始化为相同的做法,可能导致数据利用效率低下,并设定性能上限。同时,简单偏好优化(SimPO)中缺乏参考模型,降低了训练鲁棒性,并需要更严格的条件来防止灾难性遗忘。在本研究中,我们提出了Pre-DPO,这是一种基于DPO的简单而有效的训练范式,通过利用引导性参考模型来增强偏好优化性能。该参考模型提供了通过训练偏好数据可达到的最优策略状态的预见,作为一种引导机制,自适应地为更适合模型的样本分配更高权重,为不太适合的样本分配较低权重。在AlpacaEval 2.0和Arena-Hard v0.1基准上的大量实验表明,Pre-DPO在不依赖外部模型或额外数据的情况下,持续提升了DPO和SimPO的性能。
English
Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

Summary

AI-Generated Summary

PDF161April 24, 2025