RePO：基于ReLU的偏好优化算法

摘要

将大型语言模型（LLMs）与人类偏好对齐对于实际部署至关重要，然而现有方法如RLHF面临计算和稳定性挑战。尽管DPO通过单一超参数beta建立了离线范式，但后续方法如SimPO通过双参数（beta, gamma）重新引入了复杂性。我们提出了{基于ReLU的偏好优化（RePO）}，这是一种简化的算法，通过两项创新消除了beta：（1）保留SimPO的无参考边界，但通过梯度分析移除beta；（2）采用基于ReLU的最大间隔损失，自然过滤掉平凡对。理论上，RePO被描述为SimPO的极限情况（beta趋近于无穷大），其中逻辑加权退化为二元阈值，形成了0-1损失的凸包络。在AlpacaEval 2和Arena-Hard上的实验结果表明，RePO在多个基础模型上均优于DPO和SimPO，且仅需调整一个超参数。

English

Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter beta, subsequent methods like SimPO reintroduce complexity through dual parameters (beta, gamma). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates beta via two advances: (1) retaining SimPO's reference-free margins but removing beta through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case (beta to infty), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

RePO：基于ReLU的偏好优化算法

RePO: ReLU-based Preference Optimization

摘要

Summary

Support

Support