Skywork-Reward: LLM에서 보상 모델링을 위한 속임수의 가방

초록

본 보고서에서는 LLMs의 보상 모델링을 향상시키기 위한 다양한 방법을 소개하며, 특히 데이터 중심 기술에 초점을 맞추었습니다. 우리는 고품질 오픈 소스 선호 데이터셋을 선별하기 위한 효과적인 데이터 선택 및 필터링 전략을 제안하였으며, 이를 통해 Skywork-Reward 데이터 수집물을 완성하였습니다. 이 데이터 수집물은 기존 데이터셋보다 훨씬 작은 80K 선호 쌍만을 포함하고 있습니다. 이 선별된 데이터셋을 사용하여, 우리는 Skywork-Reward 모델 시리즈인 Skywork-Reward-Gemma-27B와 Skywork-Reward-Llama-3.1-8B를 개발하였습니다. 전자는 현재 RewardBench 리더보드에서 최상위 위치를 차지하고 있습니다. 특히, 우리의 기술과 데이터셋은 RewardBench의 많은 최상위 모델들의 성능을 직접 향상시키었으며, 우리의 기여가 실제 선호 학습 응용 프로그램에서의 실용적인 영향을 강조하고 있습니다.

English

In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs -- significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.

Skywork-Reward: LLM에서 보상 모델링을 위한 속임수의 가방

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

초록

Support