직접 정렬 알고리즘 간의 차이는 흐릿하다.

초록

직접 정렬 알고리즘(Direct Alignment Algorithms, DAAs)은 인간 피드백으로부터 강화 학습(Reinforcement Learning, RL) 및 보상 모델링(Reward Modeling, RM)을 대체하여 직접 정책 최적화를 통해 언어 모델 정렬을 간소화합니다. DAAs는 순위 손실(쌍별 대 단일별), 해당 손실에서 사용되는 보상(예: 정책 및 참조 정책의 우도 비율 또는 승률 비율), 또는 지도된 미세 조정(Supervised Fine-Tuning, SFT) 단계가 필요한지 여부에 따라 분류될 수 있습니다(이중 단계 대 단일 단계). 먼저 단일 단계 방법이 이중 단계 방법보다 성능이 낮다는 것을 보여줍니다. 이를 해결하기 위해 명시적인 SFT 단계를 통합하고, 단일 단계 ORPO와 ASFT에 선호도 최적화의 강도를 제어하는 베타 매개변수를 도입합니다. 이러한 수정은 Alpaca Eval 2에서 ORPO의 +3.46 및 ASFT의 +8.27의 성능을 향상시키며, DPO와 같은 이중 단계 방법과 일치시킵니다. 추가 분석 결과, 접근 방식이 특정 내재 보상이나 손실 함수보다는 쌍별 또는 단일별 목표를 사용하는지 여부가 핵심 요소임을 밝혀냅니다. 이러한 결과는 정렬 알고리즘에서 성능 향상이나 전반적인 우월성 주장을 조심스럽게 평가하는 중요성을 강조합니다.

English

Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization. DAAs can be classified by their ranking losses (pairwise vs. pointwise), by the rewards used in those losses (e.g., likelihood ratios of policy and reference policy, or odds ratios), or by whether a Supervised Fine-Tuning (SFT) phase is required (two-stage vs. one-stage). We first show that one-stage methods underperform two-stage methods. To address this, we incorporate an explicit SFT phase and introduce the beta parameter, controlling the strength of preference optimization, into single-stage ORPO and ASFT. These modifications improve their performance in Alpaca Eval 2 by +3.46 (ORPO) and +8.27 (ASFT), matching two-stage methods like DPO. Further analysis reveals that the key factor is whether the approach uses pairwise or pointwise objectives, rather than the specific implicit reward or loss function. These results highlight the importance of careful evaluation to avoid premature claims of performance gains or overall superiority in alignment algorithms.

직접 정렬 알고리즘 간의 차이는 흐릿하다.

The Differences Between Direct Alignment Algorithms are a Blur

초록

Support