RRM:強健的獎勵模型訓練有助於緩解獎勵入侵。
RRM: Robust Reward Model Training Mitigates Reward Hacking
September 20, 2024
作者: Tianqi Liu, Wei Xiong, Jie Ren, Lichang Chen, Junru Wu, Rishabh Joshi, Yang Gao, Jiaming Shen, Zhen Qin, Tianhe Yu, Daniel Sohn, Anastasiia Makarova, Jeremiah Liu, Yuan Liu, Bilal Piot, Abe Ittycheriah, Aviral Kumar, Mohammad Saleh
cs.AI
摘要
獎勵模型(RMs)在對齊大型語言模型(LLMs)與人類偏好方面扮演著關鍵角色。然而,傳統的獎勵模型訓練依賴於與特定提示相關聯的回應對,卻難以將受提示驅動的偏好與回應長度和格式等與提示無關的人為因素區分開來。在這項研究中,我們揭示了當前獎勵模型訓練方法的一個基本限制,即在確定偏好時,獎勵模型無法有效區分上下文信號和無關的人為因素。為了解決這個問題,我們引入了一個因果框架,該框架學習與這些人為因素無關的偏好,並提出了一種旨在消除這些因素的新型數據擴增技術。大量實驗表明,我們的方法成功地過濾掉不良人為因素,產生了更穩健的獎勵模型(RRM)。我們的RRM提高了在Gemma-2-9b-it上訓練的成對獎勵模型的性能,從80.61%提高到84.15%。此外,我們使用RM和RRM訓練了兩個DPO策略,顯示RRM顯著增強了與DPO對齊的策略,將MT-Bench得分從7.27提高到8.31,並將AlpacaEval-2中的長度控制勝率從33.46%提高到52.49%。
English
Reward models (RMs) play a pivotal role in aligning large language models
(LLMs) with human preferences. However, traditional RM training, which relies
on response pairs tied to specific prompts, struggles to disentangle
prompt-driven preferences from prompt-independent artifacts, such as response
length and format. In this work, we expose a fundamental limitation of current
RM training methods, where RMs fail to effectively distinguish between
contextual signals and irrelevant artifacts when determining preferences. To
address this, we introduce a causal framework that learns preferences
independent of these artifacts and propose a novel data augmentation technique
designed to eliminate them. Extensive experiments show that our approach
successfully filters out undesirable artifacts, yielding a more robust reward
model (RRM). Our RRM improves the performance of a pairwise reward model
trained on Gemma-2-9b-it, on RewardBench, increasing accuracy from 80.61% to
84.15%. Additionally, we train two DPO policies using both the RM and RRM,
demonstrating that the RRM significantly enhances DPO-aligned policies,
improving MT-Bench scores from 7.27 to 8.31 and length-controlled win-rates in
AlpacaEval-2 from 33.46% to 52.49%.Summary
AI-Generated Summary