混合偏好：学习为人类与人工智能反馈路由实例

摘要

从人类反馈中学习使得语言模型（LM）能够与人类偏好保持一致。然而，直接收集人类偏好可能昂贵、耗时且具有较高的变异性。一种吸引人的替代方案是从LM中提炼偏好作为合成标注的来源，因为它们比人类标注更一致、更便宜且更易扩展；然而，它们也容易受到偏见和错误的影响。在这项工作中，我们引入了一个路由框架，将来自人类和LM的输入结合起来，以实现更好的标注质量，同时降低人类标注的总成本。我们方法的关键在于识别将受益于人类标注的偏好实例。我们将这视为一个优化问题：给定一个偏好数据集和一个评估指标，我们训练一个性能预测模型，以预测奖励模型在任意人类和LM标注组合上的表现，并采用路由策略选择最大化预测表现的组合。我们在一个新的偏好数据集MultiPref上训练性能预测模型，其中包含10K个实例，配对有人类和LM标签。我们展示，使用我们的路由框架选择的LM和直接人类偏好的混合组合，比仅使用其中一个能够实现更好的奖励模型表现。我们在另外三个数据集上模拟选择性人类偏好收集，并展示我们的方法对所有三个数据集都具有良好的泛化能力。我们分析路由模型的特征，以识别那些可以受益于人类反馈的实例特征，例如，具有适度安全关注或适度意图复杂性的提示。我们发布了本研究中使用的数据集、标注平台和源代码，以促进未来更高效和准确的偏好收集。

English

Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, directly collecting human preferences can be expensive, time-consuming, and can have high variance. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations as they are more consistent, cheaper, and scale better than human annotation; however, they are also prone to biases and errors. In this work, we introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality, while reducing the total cost of human annotation. The crux of our approach is to identify preference instances that will benefit from human annotations. We formulate this as an optimization problem: given a preference dataset and an evaluation metric, we train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations and employ a routing strategy that selects a combination that maximizes predicted performance. We train the performance prediction model on MultiPref, a new preference dataset with 10K instances paired with human and LM labels. We show that the selected hybrid mixture of LM and direct human preferences using our routing framework achieves better reward model performance compared to using either one exclusively. We simulate selective human preference collection on three other datasets and show that our method generalizes well to all three. We analyze features from the routing model to identify characteristics of instances that can benefit from human feedback, e.g., prompts with a moderate safety concern or moderate intent complexity. We release the dataset, annotation platform, and source code used in this study to foster more efficient and accurate preference collection in the future.

混合偏好：学习为人类与人工智能反馈路由实例

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback

摘要

Summary

Support

Support