ChatPaper.aiChatPaper

寻找最佳平衡点:扩展偏好优化的数据构建策略

Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization

February 24, 2025
作者: Yao Xiao, Hai Ye, Linyao Chen, Hwee Tou Ng, Lidong Bing, Xiaoli Li, Roy Ka-wei Lee
cs.AI

摘要

迭代数据生成与模型重训练被广泛用于对齐大型语言模型(LLMs)。这一过程通常涉及一个策略模型生成策略内响应,以及一个奖励模型指导训练数据的选择。直接偏好优化(DPO)通过构建选择与拒绝响应的偏好对,进一步强化了这一流程。在本研究中,我们旨在通过重复随机采样扩大策略内样本数量,以提升对齐性能。传统做法是选取奖励最高的样本作为选择项,奖励最低的作为拒绝项用于DPO。然而,我们的实验表明,随着样本量增加,这一策略会导致性能下降。针对此问题,我们从样本奖励的潜在正态分布视角出发,探讨了偏好数据的构建方法。我们将奖励空间划分为七个代表性点,并系统性地探索了所有21种(C_7^2)两两组合。通过在AlpacaEval 2上对四个模型的评估,我们发现,选择位于奖励位置μ - 2σ而非最低奖励的拒绝响应,对于实现最佳性能至关重要。最终,我们提出了一种可扩展的偏好数据构建策略,该策略随着样本规模的扩大持续提升模型性能。
English
Iterative data generation and model retraining are widely used to align large language models (LLMs). It typically involves a policy model to generate on-policy responses and a reward model to guide training data selection. Direct Preference Optimization (DPO) further enhances this process by constructing preference pairs of chosen and rejected responses. In this work, we aim to scale up the number of on-policy samples via repeated random sampling to improve alignment performance. Conventional practice selects the sample with the highest reward as chosen and the lowest as rejected for DPO. However, our experiments reveal that this strategy leads to a decline in performance as the sample size increases. To address this, we investigate preference data construction through the lens of underlying normal distribution of sample rewards. We categorize the reward space into seven representative points and systematically explore all 21 (C_7^2) pairwise combinations. Through evaluations on four models using AlpacaEval 2, we find that selecting the rejected response at reward position mu - 2sigma rather than the minimum reward, is crucial for optimal performance. We finally introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.

Summary

AI-Generated Summary

PDF62February 26, 2025