하이브리드 선호도: 인간 대 AI를 위한 인스턴스 경로 지정 학습

초록

인간 피드백으로부터 학습하는 것은 언어 모델(LMs)을 인간의 선호에 맞게 조정하는 것을 가능케했습니다. 그러나 직접 인간의 선호를 수집하는 것은 비용이 많이 들고, 시간이 많이 소요되며, 분산이 높을 수 있습니다. 매력적인 대안은 LMs로부터 선호를 추출하여 합성 주석의 원천으로 사용하는 것인데, 이는 인간 주석보다 일관성이 더 높고, 더 저렴하며, 더 잘 확장됩니다. 그러나 이러한 방법은 편향과 오류에 취약합니다. 본 연구에서는 인간과 LMs의 입력을 결합하여 더 나은 주석 품질을 달성하고, 인간 주석의 총 비용을 줄이는 라우팅 프레임워크를 소개합니다. 접근 방식의 핵심은 인간 주석에서 혜택을 받을 선호 사례를 식별하는 것입니다. 이를 최적화 문제로 정의합니다: 선호 데이터셋과 평가 메트릭이 주어졌을 때, 임의의 인간 및 LM 주석 조합에 대한 보상 모델의 성능을 예측하는 성능 예측 모델을 훈련하고, 예측된 성능을 최대화하는 조합을 선택하는 라우팅 전략을 채택합니다. 우리는 10,000개의 사례로 구성된 새로운 선호 데이터셋 MultiPref에서 성능 예측 모델을 훈련시키고, 인간 및 LM 레이블과 쌍을 이룬 데이터셋을 사용합니다. 우리의 라우팅 프레임워크를 사용하여 선택된 LM 및 직접적인 인간 선호의 혼합은 오로지 한 가지를 사용하는 것보다 더 나은 보상 모델 성능을 달성합니다. 우리는 세 가지 다른 데이터셋에서 선택적 인간 선호 수집을 시뮬레이션하고, 우리의 방법이 세 데이터셋 모두에 잘 일반화되는 것을 보여줍니다. 우리는 라우팅 모델의 특징을 분석하여 인간 피드백에서 혜택을 받을 수 있는 사례의 특성을 식별합니다. 예를 들어, 중간 정도의 안전 문제나 중간 정도의 의도 복잡성을 가진 프롬프트 등입니다. 우리는 미래에 더 효율적이고 정확한 선호 수집을 촉진하기 위해 이 연구에 사용된 데이터셋, 주석 플랫폼, 소스 코드를 공개합니다.

English

Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, directly collecting human preferences can be expensive, time-consuming, and can have high variance. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations as they are more consistent, cheaper, and scale better than human annotation; however, they are also prone to biases and errors. In this work, we introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality, while reducing the total cost of human annotation. The crux of our approach is to identify preference instances that will benefit from human annotations. We formulate this as an optimization problem: given a preference dataset and an evaluation metric, we train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations and employ a routing strategy that selects a combination that maximizes predicted performance. We train the performance prediction model on MultiPref, a new preference dataset with 10K instances paired with human and LM labels. We show that the selected hybrid mixture of LM and direct human preferences using our routing framework achieves better reward model performance compared to using either one exclusively. We simulate selective human preference collection on three other datasets and show that our method generalizes well to all three. We analyze features from the routing model to identify characteristics of instances that can benefit from human feedback, e.g., prompts with a moderate safety concern or moderate intent complexity. We release the dataset, annotation platform, and source code used in this study to foster more efficient and accurate preference collection in the future.

하이브리드 선호도: 인간 대 AI를 위한 인스턴스 경로 지정 학습

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback

초록

Support