ChatPaper.aiChatPaper

HelpSteer2-Preference:將評分與偏好相結合

HelpSteer2-Preference: Complementing Ratings with Preferences

October 2, 2024
作者: Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, Yi Dong
cs.AI

摘要

獎勵模型對於使模型遵循指示至關重要,通常根據兩種流行範式之一進行訓練:Bradley-Terry風格或迴歸風格。然而,目前缺乏證據表明其中一種方法在適當匹配數據時優於另一種。主要原因是這些方法需要以不同(但不相容)格式收集的數據,這意味著現有公共數據集中並不存在適當匹配的數據。為解決此問題,我們在HelpSteer2數據集中釋出偏好標註(設計用於Bradley-Terry訓練),以補充現有評分(設計用於迴歸風格訓練)。為提高數據可解釋性,偏好標註附帶人類撰寫的理由。利用這些數據,我們進行了首次對Bradley-Terry和迴歸模型進行適當匹配數據的直接比較。基於從這種比較中獲得的見解,我們提出了一種結合Bradley-Terry和迴歸獎勵建模的新方法。通過此方法調整的Llama-3.1-70B-Instruct模型在RewardBench上得分94.1,截至2024年10月1日,在140多個獎勵模型中名列前茅。我們還展示了此獎勵模型在RLHF中將模型對齊以遵循指示的有效性。我們以CC-BY-4.0許可證開源此數據集,網址為https://huggingface.co/datasets/nvidia/HelpSteer2,並公開釋出經過訓練的獎勵模型,網址為https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward。
English
Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. We also demonstrate the effectiveness of this reward model at aligning models to follow instructions in RLHF. We open-source this dataset (CC-BY-4.0 license) at https://huggingface.co/datasets/nvidia/HelpSteer2 and openly release the trained Reward Model at https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward

Summary

AI-Generated Summary

PDF245November 16, 2024