探索人类反馈强化学习中的数据缩放趋势与影响

摘要

基于人类反馈的强化学习（RLHF）对于使大型语言模型与人类偏好保持一致至关重要。尽管近期研究主要聚焦于算法改进，但提示数据构建的重要性却被忽视了。本文通过探索RLHF性能扩展中的数据驱动瓶颈，特别是奖励破解和响应多样性下降问题，填补了这一空白。我们引入了一种混合奖励系统，结合了推理任务验证器（RTV）和生成式奖励模型（GenRM），以缓解奖励破解现象。同时，我们提出了一种新颖的提示选择方法——Pre-PPO，以保持响应多样性并提升学习效率。此外，我们发现，在RLHF训练的早期阶段优先处理数学和编程任务能显著提升性能。通过两种模型规模的实验验证了所提方法的有效性和可扩展性。结果表明，RTV对奖励破解的抵抗力最强，其次是基于真实数据的GenRM，再次是基于SFT Best-of-N响应的GenRM。我们的策略能够快速捕捉任务特定的细微差别，从而大幅提升RLHF的整体性能。本研究强调了精细数据构建的重要性，并提供了克服RLHF性能瓶颈的实用方法。

English

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods' effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.

探索人类反馈强化学习中的数据缩放趋势与影响

Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

摘要

Summary

Support

Support