ChatPaper.aiChatPaper

利用偏好表示進行一般偏好建模,以對齊語言模型

General Preference Modeling with Preference Representations for Aligning Language Models

October 3, 2024
作者: Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, Quanquan Gu
cs.AI

摘要

建模人類偏好對於對齊基礎模型與人類價值觀至關重要。傳統的獎勵建模方法,如布拉德利-特裡(BT)獎勵模型,在表達能力方面存在不足,特別是在處理不傳遞偏好時。儘管監督對偶偏好模型(PairPM)可以表達一般偏好,但它們的實施非常特定,無法保證比較對的一致偏好概率。此外,由於在比較多個回應時具有二次查詢複雜度,它們會導致高計算成本。在本文中,我們介紹了偏好表示學習,這是一種將回應嵌入潛在空間以有效捕捉複雜偏好結構的方法,實現了線性查詢複雜度。此外,我們提出了基於偏好分數的通用偏好優化(GPO),它從人類反饋中推廣了基於獎勵的強化學習。實驗結果顯示,我們的通用偏好表示模型(GPM)在RewardBench基準測試中優於BT獎勵模型,優勢高達5.6%,並有效地建模了循環偏好,其中任何BT獎勵模型都表現得像隨機猜測。此外,在通過使用GPO和我們的通用偏好模型對語言模型進行後訓練後,對AlpacaEval2.0和MT-Bench等下游任務的評估顯示出高達9.3%的顯著性能改進。這些發現表明,我們的方法可能增強基礎模型與微妙人類價值觀的對齊。代碼可在https://github.com/general-preference/general-preference-model找到。
English
Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. Although supervised pair preference models (PairPM) can express general preferences, their implementation is highly ad-hoc and cannot guarantee a consistent preference probability of compared pairs. Additionally, they impose high computational costs due to their quadratic query complexity when comparing multiple responses. In this paper, we introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Experimental results show that our General Preference representation model (GPM) outperforms the BT reward model on the RewardBench benchmark with a margin of up to 5.6% and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0 and MT-Bench, following the language model post-training with GPO and our general preference model, reveal substantial performance improvements with margins up to 9.3%. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model.

Summary

AI-Generated Summary

PDF94November 16, 2024