通过主成分分析重新审视多样化人类偏好的学习
Rethinking Diverse Human Preference Learning through Principal Component Analysis
February 18, 2025
作者: Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen
cs.AI
摘要
理解人类偏好对于改进基础模型和构建个性化AI系统至关重要。然而,偏好本质上是多样且复杂的,这使得传统奖励模型难以全面捕捉其范围。虽然细粒度的偏好数据有所帮助,但收集这些数据成本高昂且难以扩展。本文提出了一种新颖的方法——分解奖励模型(DRMs),它能够从二元比较中提取多样的人类偏好,而无需细粒度的标注。我们的核心洞见是将人类偏好表示为向量,并通过主成分分析(PCA)对其进行解析。通过构建一个包含偏好与拒绝响应之间嵌入差异的数据集,DRMs识别出捕捉偏好不同方面的正交基向量。这些分解后的奖励可以灵活组合,以符合不同用户的需求,为传统奖励模型提供了一个可解释且可扩展的替代方案。我们证明,DRMs能有效提取有意义的偏好维度(如帮助性、安全性、幽默感),并能在无需额外训练的情况下适应新用户。我们的研究结果凸显了DRMs作为个性化和可解释大语言模型对齐的强大框架。
English
Understanding human preferences is crucial for improving foundation models
and building personalized AI systems. However, preferences are inherently
diverse and complex, making it difficult for traditional reward models to
capture their full range. While fine-grained preference data can help,
collecting it is expensive and hard to scale. In this paper, we introduce
Decomposed Reward Models (DRMs), a novel approach that extracts diverse human
preferences from binary comparisons without requiring fine-grained annotations.
Our key insight is to represent human preferences as vectors and analyze them
using Principal Component Analysis (PCA). By constructing a dataset of
embedding differences between preferred and rejected responses, DRMs identify
orthogonal basis vectors that capture distinct aspects of preference. These
decomposed rewards can be flexibly combined to align with different user needs,
offering an interpretable and scalable alternative to traditional reward
models. We demonstrate that DRMs effectively extract meaningful preference
dimensions (e.g., helpfulness, safety, humor) and adapt to new users without
additional training. Our results highlight DRMs as a powerful framework for
personalized and interpretable LLM alignment.Summary
AI-Generated Summary