HelpSteer2-Preference: 선호도와 평가 점수 보완하기

초록

보상 모델은 모델이 지시에 따르도록 정렬하는 데 중요하며 일반적으로 두 가지 인기있는 패러다임 중 하나를 따라 교육됩니다: 브래들리-테리 스타일 또는 회귀 스타일. 그러나 충분히 일치하는 데이터로 교육할 때 어느 접근 방식이 다른 것보다 우수한지에 대한 증거가 부족합니다. 이는 이러한 접근 방식이 서로 다른 (그러나 호환되지 않는) 형식으로 수집된 데이터를 필요로 하기 때문에 기존의 공개 데이터셋에서 충분히 일치하는 데이터를 사용할 수 없기 때문입니다. 이 문제에 대처하기 위해 우리는 HelpSteer2 데이터셋에 기존의 평가 (회귀 스타일 교육용)을 보완하는 브래들리-테리 교육용으로 설계된 선호 주석을 공개합니다. 데이터 해석을 개선하기 위해 선호 주석은 인간이 작성한 정당화와 함께 제공됩니다. 이 데이터를 사용하여 우리는 데이터를 충분히 일치시킨 상태에서 브래들리-테리 및 회귀 모델의 첫 번째 직접 비교를 실시합니다. 이러한 비교에서 얻은 통찰을 기반으로 브래들리-테리와 회귀 보상 모델링을 결합하는 새로운 접근 방식을 제안합니다. 이 접근 방식으로 조정된 Llama-3.1-70B-Instruct 모델은 2024년 10월 1일 기준으로 RewardBench에서 140개 이상의 보상 모델 중 최고 점수인 94.1을 기록합니다. 또한 RLHF에서 모델이 지시를 따르도록 정렬하는 이 보상 모델의 효과를 증명합니다. 우리는 이 데이터셋을 오픈 소스로 공개하며(CC-BY-4.0 라이선스), https://huggingface.co/datasets/nvidia/HelpSteer2 에서 훈련된 보상 모델을 공개적으로 공개합니다. https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward 에서 확인할 수 있습니다.

English

Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. We also demonstrate the effectiveness of this reward model at aligning models to follow instructions in RLHF. We open-source this dataset (CC-BY-4.0 license) at https://huggingface.co/datasets/nvidia/HelpSteer2 and openly release the trained Reward Model at https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward

HelpSteer2-Preference: 선호도와 평가 점수 보완하기

HelpSteer2-Preference: Complementing Ratings with Preferences

초록

Summary

Support

Support