PILAF:用于奖励建模的最佳人类偏好采样
PILAF: Optimal Human Preference Sampling for Reward Modeling
February 6, 2025
作者: Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng, Julia Kempe, Yaqi Duan
cs.AI
摘要
随着大型语言模型在推动现实世界应用方面的作用日益增强,将其与人类价值观保持一致变得至关重要。从人类反馈中进行强化学习(RLHF)已经成为一种关键技术,当神谕式人类价值观无法获得时,将偏好数据转化为奖励模型。在实践中,RLHF 主要依赖于近似奖励模型,这些模型可能无法始终引导策略朝向最大化潜在的人类价值观。我们提出了一种名为Policy-Interpolated Learning for Aligned Feedback(PILAF)的新型响应采样策略,用于偏好标记,明确将偏好学习与最大化潜在的神谕奖励保持一致。PILAF 在理论上得到了充分的支撑,从优化和统计角度均展现出最优性。该方法易于实施,并在反馈策划至关重要的迭代和在线 RLHF 环境中展现出强大的性能。
English
As large language models increasingly drive real-world applications, aligning
them with human values becomes paramount. Reinforcement Learning from Human
Feedback (RLHF) has emerged as a key technique, translating preference data
into reward models when oracle human values remain inaccessible. In practice,
RLHF mostly relies on approximate reward models, which may not consistently
guide the policy toward maximizing the underlying human values. We propose
Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response
sampling strategy for preference labeling that explicitly aligns preference
learning with maximizing the underlying oracle reward. PILAF is theoretically
grounded, demonstrating optimality from both an optimization and a statistical
perspective. The method is straightforward to implement and demonstrates strong
performance in iterative and online RLHF settings where feedback curation is
critical.Summary
AI-Generated Summary