DPO-Shift:改变直接偏好优化的分布
DPO-Shift: Shifting the Distribution of Direct Preference Optimization
February 11, 2025
作者: Xiliang Yang, Feng Jiang, Qianen Zhang, Lei Zhao, Xiao Li
cs.AI
摘要
直接偏好优化(Direct Preference Optimization,DPO)及其变体已日益流行,用于使语言模型与人类偏好保持一致。这些方法旨在教导模型更好地区分所选(或偏好)和被拒绝(或不偏好)的响应。然而,先前的研究发现,在训练过程中,所选响应的概率通常会下降,这一现象被称为概率位移。为了解决这一挑战,在本研究中,我们引入了\method,以可控方式转移所选概率的分布。然后,我们展示\method在提高所选概率和牺牲奖励边际之间存在根本的权衡,这得到了理论分析和实验证实的支持。此外,我们展示了\method在下游任务(如MT-Bench和设计的胜率实验)中优于DPO的优越性。我们相信这项研究表明,DPO的概率位移问题可以通过一个简单、理论上基础的解决方案得到有效缓解。我们的代码可在https://github.com/Meaquadddd/DPO-Shift找到。
English
Direct Preference Optimization (DPO) and its variants have become
increasingly popular for aligning language models with human preferences. These
methods aim to teach models to better distinguish between chosen (or preferred)
and rejected (or dispreferred) responses. However, prior research has
identified that the probability of chosen responses often decreases during
training, and this phenomenon is known as likelihood displacement. To tackle
this challenge, in this work we introduce \method to controllably shift the
distribution of the chosen probability. Then, we show that \method exhibits a
fundamental trade-off between improving the chosen probability and sacrificing
the reward margin, as supported by both theoretical analysis and experimental
validation. Furthermore, we demonstrate the superiority of \method over DPO on
downstream tasks such as MT-Bench and a designed win rate experiment. We
believe this study shows that the likelihood displacement issue of DPO can be
effectively mitigated with a simple, theoretically grounded solution. Our code
is available at https://github.com/Meaquadddd/DPO-Shift.Summary
AI-Generated Summary