Skywork-Reward:在語言生成模型中用於獎勵建模的一揆。
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
October 24, 2024
作者: Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, Yahui Zhou
cs.AI
摘要
在本報告中,我們介紹了一系列增強LLMs獎勵建模的方法,專注於以數據為中心的技術。我們提出了有效的數據選擇和篩選策略,用於精心編纂高質量的開源偏好數據集,最終形成了Skywork-Reward數據集,其中僅包含80K對偏好對,明顯小於現有數據集。利用這個精心策劃的數據集,我們開發了Skywork-Reward模型系列 -- Skywork-Reward-Gemma-27B和Skywork-Reward-Llama-3.1-8B -- 其中前者目前在RewardBench排行榜上佔據領先位置。值得注意的是,我們的技術和數據集直接提升了許多排名靠前的模型在RewardBench上的表現,凸顯了我們在現實世界偏好學習應用中貢獻的實際影響。
English
In this report, we introduce a collection of methods to enhance reward
modeling for LLMs, focusing specifically on data-centric techniques. We propose
effective data selection and filtering strategies for curating high-quality
open-source preference datasets, culminating in the Skywork-Reward data
collection, which contains only 80K preference pairs -- significantly smaller
than existing datasets. Using this curated dataset, we developed the
Skywork-Reward model series -- Skywork-Reward-Gemma-27B and
Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top
position on the RewardBench leaderboard. Notably, our techniques and datasets
have directly enhanced the performance of many top-ranked models on
RewardBench, highlighting the practical impact of our contributions in
real-world preference learning applications.Summary
AI-Generated Summary