混合偏好：學習為人類與 AI 反饋路由實例

摘要

從人類反饋中學習使語言模型（LM）與人類偏好保持一致成為可能。然而，直接收集人類偏好可能昂貴、耗時且變異性高。一個吸引人的替代方案是從LM中提煉偏好，作為合成標註的來源，因為它們比人類標註更一致、更便宜且更易擴展；然而，它們也容易受到偏見和錯誤的影響。在這項工作中，我們介紹了一個路由框架，將來自人類和LM的輸入結合起來，以達到更好的標註質量，同時降低人類標註的總成本。我們方法的關鍵是識別那些受益於人類標註的偏好實例。我們將其制定為一個優化問題：給定一個偏好數據集和一個評估指標，我們訓練一個性能預測模型，來預測獎勵模型在任意組合的人類和LM標註上的表現，並採用一種路由策略來選擇最大化預測表現的組合。我們在一個新的偏好數據集MultiPref上訓練性能預測模型，該數據集包含10K個實例，配對有人類和LM標籤。我們展示了使用我們的路由框架選擇的LM和直接人類偏好的混合結果，相較於僅使用其中一種，實現了更好的獎勵模型表現。我們在其他三個數據集上模擬有選擇性的人類偏好收集，並展示我們的方法對所有三個都有良好的泛化能力。我們分析了路由模型的特徵，以識別那些可以受益於人類反饋的實例特徵，例如，具有中等安全關注或中等意圖複雜性的提示。我們釋出了本研究中使用的數據集、標註平台和源代碼，以促進未來更高效和準確的偏好收集。

English

Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, directly collecting human preferences can be expensive, time-consuming, and can have high variance. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations as they are more consistent, cheaper, and scale better than human annotation; however, they are also prone to biases and errors. In this work, we introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality, while reducing the total cost of human annotation. The crux of our approach is to identify preference instances that will benefit from human annotations. We formulate this as an optimization problem: given a preference dataset and an evaluation metric, we train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations and employ a routing strategy that selects a combination that maximizes predicted performance. We train the performance prediction model on MultiPref, a new preference dataset with 10K instances paired with human and LM labels. We show that the selected hybrid mixture of LM and direct human preferences using our routing framework achieves better reward model performance compared to using either one exclusively. We simulate selective human preference collection on three other datasets and show that our method generalizes well to all three. We analyze features from the routing model to identify characteristics of instances that can benefit from human feedback, e.g., prompts with a moderate safety concern or moderate intent complexity. We release the dataset, annotation platform, and source code used in this study to foster more efficient and accurate preference collection in the future.

混合偏好：學習為人類與 AI 反饋路由實例

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback

摘要

Summary

Support

Support