ChatPaper.aiChatPaper

通过主成分分析重新审视多样化人类偏好的学习

Rethinking Diverse Human Preference Learning through Principal Component Analysis

February 18, 2025
作者: Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen
cs.AI

摘要

理解人类偏好对于改进基础模型和构建个性化AI系统至关重要。然而,偏好本质上是多样且复杂的,这使得传统奖励模型难以全面捕捉其范围。虽然细粒度的偏好数据有所帮助,但收集这些数据成本高昂且难以扩展。本文提出了一种新颖的方法——分解奖励模型(DRMs),它能够从二元比较中提取多样的人类偏好,而无需细粒度的标注。我们的核心洞见是将人类偏好表示为向量,并通过主成分分析(PCA)对其进行解析。通过构建一个包含偏好与拒绝响应之间嵌入差异的数据集,DRMs识别出捕捉偏好不同方面的正交基向量。这些分解后的奖励可以灵活组合,以符合不同用户的需求,为传统奖励模型提供了一个可解释且可扩展的替代方案。我们证明,DRMs能有效提取有意义的偏好维度(如帮助性、安全性、幽默感),并能在无需额外训练的情况下适应新用户。我们的研究结果凸显了DRMs作为个性化和可解释大语言模型对齐的强大框架。
English
Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment.

Summary

AI-Generated Summary

PDF353February 19, 2025