언어 모델을 정렬하기 위한 선호 표현을 사용한 일반적인 선호 모델링

초록

인간의 선호를 모델링하는 것은 기초 모델이 인간의 가치와 일치하도록 하는 데 중요합니다. 전통적인 보상 모델링 방법인 Bradley-Terry(BT) 보상 모델은 특히 비전환적 선호를 다루는 데 표현력이 부족합니다. 지도된 쌍 선호 모델(PairPM)은 일반적인 선호를 표현할 수 있지만 구현이 매우 특수하며 비교된 쌍의 일관된 선호 확률을 보장할 수 없습니다. 또한, 여러 응답을 비교할 때 이차 쿼리 복잡성으로 인해 높은 계산 비용을 부과합니다. 본 논문에서는 응답을 잠재 공간에 임베딩하여 복잡한 선호 구조를 효율적으로 포착하는 선호 표현 학습을 소개하며 선호 쿼리 복잡성을 선형으로 달성합니다. 더불어, 우리는 선호 점수 기반 일반 선호 최적화(GPO)를 제안하여 인간 피드백으로부터 보상 기반 강화 학습을 일반화합니다. 실험 결과는 우리의 일반 선호 표현 모델(GPM)이 RewardBench 벤치마크에서 BT 보상 모델을 최대 5.6%까지 앞선 성능을 보이며 BT 보상 모델이 무작위 추측과 같이 동작하는 순환적 선호도 효과적으로 모델링한다는 것을 보여줍니다. 또한, GPO 및 일반 선호 모델로 언어 모델 사후 훈련을 진행한 후 AlpacaEval2.0 및 MT-Bench에서 하류 작업에 대한 평가는 최대 9.3%까지 성능 향상을 보여줍니다. 이러한 결과는 우리의 방법이 기초 모델을 섬세한 인간의 가치와 일치시키는 데 도움이 될 수 있다는 것을 나타냅니다. 코드는 https://github.com/general-preference/general-preference-model에서 사용할 수 있습니다.

English

Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. Although supervised pair preference models (PairPM) can express general preferences, their implementation is highly ad-hoc and cannot guarantee a consistent preference probability of compared pairs. Additionally, they impose high computational costs due to their quadratic query complexity when comparing multiple responses. In this paper, we introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Experimental results show that our General Preference representation model (GPM) outperforms the BT reward model on the RewardBench benchmark with a margin of up to 5.6% and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0 and MT-Bench, following the language model post-training with GPO and our general preference model, reveal substantial performance improvements with margins up to 9.3%. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model.

언어 모델을 정렬하기 위한 선호 표현을 사용한 일반적인 선호 모델링

General Preference Modeling with Preference Representations for Aligning Language Models

초록

Summary

Support

Support