基於邊界的語言模型對齊的一個常見陷阱:梯度糾纏。
A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement
October 17, 2024
作者: Hui Yuan, Yifan Zeng, Yue Wu, Huazheng Wang, Mengdi Wang, Liu Leqi
cs.AI
摘要
從人類反饋中學習的強化學習(RLHF)已成為語言模型(LM)對齊的主要方法。在其核心,RLHF 使用基於邊界的損失進行偏好優化,僅通過首選和非首選回應之間的差異來指定理想的 LM 行為。在本文中,我們識別了基於邊界方法的一個常見陷阱——對首選和非首選回應上的理想 LM 行為進行不足的具體說明,這導致兩個意外後果隨著邊界的增加而出現:(1)非首選(例如,不安全)回應的概率可能增加,導致潛在的安全對齊失敗。 (2)即使這些回應是理想的,首選回應的概率也可能降低。我們揭示了這些問題行為背後的原因:基於邊界的損失將首選概率的變化與非首選概率的梯度耦合在一起,反之亦然,通常阻止首選概率增加,而非首選概率降低,從而導致兩個概率同步增加或減少。我們將這種效應稱為基於邊界目標固有的梯度纏結。在形式上,我們推導了一般基於邊界對齊目標的條件,其中梯度纏結變得令人擔憂:首選和非首選對數概率的梯度的內積相對於個別梯度範數較大。我們從理論上研究了在對齊語言模型時為什麼這樣的內積可能很大,並在實踐中驗證了我們的發現。我們框架的實證影響擴展到解釋各種偏好優化算法的訓練動態中的重要差異,並提出潛在的算法設計來減輕基於邊界方法的理想 LM 行為不足問題,從而改善語言模型對齊。
English
Reinforcement Learning from Human Feedback (RLHF) has become the predominant
approach for language model (LM) alignment. At its core, RLHF uses a
margin-based loss for preference optimization, specifying ideal LM behavior
only by the difference between preferred and dispreferred responses. In this
paper, we identify a common pitfall of margin-based methods -- the
under-specification of ideal LM behavior on preferred and dispreferred
responses individually, which leads to two unintended consequences as the
margin increases: (1) The probability of dispreferred (e.g., unsafe) responses
may increase, resulting in potential safety alignment failures. (2) The
probability of preferred responses may decrease, even when those responses are
ideal. We demystify the reasons behind these problematic behaviors:
margin-based losses couple the change in the preferred probability to the
gradient of the dispreferred one, and vice versa, often preventing the
preferred probability from increasing while the dispreferred one decreases, and
thus causing a synchronized increase or decrease in both probabilities. We term
this effect, inherent in margin-based objectives, gradient entanglement.
Formally, we derive conditions for general margin-based alignment objectives
under which gradient entanglement becomes concerning: the inner product of the
gradients of preferred and dispreferred log-probabilities is large relative to
the individual gradient norms. We theoretically investigate why such inner
products can be large when aligning language models and empirically validate
our findings. Empirical implications of our framework extend to explaining
important differences in the training dynamics of various preference
optimization algorithms, and suggesting potential algorithm designs to mitigate
the under-specification issue of margin-based methods and thereby improving
language model alignment.Summary
AI-Generated Summary