直接对齐算法之间的区别变得模糊。
The Differences Between Direct Alignment Algorithms are a Blur
February 3, 2025
作者: Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov
cs.AI
摘要
直接对齐算法(DAAs)通过在从人类反馈中强化学习(RLHF)中用直接策略优化替代强化学习(RL)和奖励建模(RM)来简化语言模型对齐。DAAs可以根据其排名损失(成对对比 vs. 点对点)、在这些损失中使用的奖励(例如,策略和参考策略的似然比或赔率比)或是否需要监督微调(SFT)阶段(两阶段 vs. 一阶段)进行分类。我们首先展示一阶段方法表现不如两阶段方法。为了解决这个问题,我们将显式SFT阶段和控制偏好优化强度的beta参数引入单阶段ORPO和ASFT。这些修改提高了它们在Alpaca Eval 2中的性能,ORPO提高了+3.46,ASFT提高了+8.27,与DPO等两阶段方法相匹敌。进一步的分析揭示了关键因素是方法是否使用成对对比或点对点目标,而不是特定的隐式奖励或损失函数。这些结果突显了仔细评估的重要性,以避免过早宣称对齐算法的性能提升或整体优越性。
English
Direct Alignment Algorithms (DAAs) simplify language model alignment by
replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement
Learning from Human Feedback (RLHF) with direct policy optimization. DAAs can
be classified by their ranking losses (pairwise vs. pointwise), by the rewards
used in those losses (e.g., likelihood ratios of policy and reference policy,
or odds ratios), or by whether a Supervised Fine-Tuning (SFT) phase is required
(two-stage vs. one-stage). We first show that one-stage methods underperform
two-stage methods. To address this, we incorporate an explicit SFT phase and
introduce the beta parameter, controlling the strength of preference
optimization, into single-stage ORPO and ASFT. These modifications improve
their performance in Alpaca Eval 2 by +3.46 (ORPO) and +8.27 (ASFT),
matching two-stage methods like DPO. Further analysis reveals that the key
factor is whether the approach uses pairwise or pointwise objectives, rather
than the specific implicit reward or loss function. These results highlight the
importance of careful evaluation to avoid premature claims of performance gains
or overall superiority in alignment algorithms.Summary
AI-Generated Summary