透過自導向優化來對齊大型語言模型

摘要

自動對齊系統開發了具有最少人為干預的對齊系統。自動對齊的關鍵在於提供可學習且準確的偏好信號，以進行偏好學習，而無需人工標註。本文介紹了自主導向優化（SSO）算法，該算法在迭代訓練期間基於預定原則自主生成高質量的偏好信號，消除了手動標註的需求。SSO通過確保所選和拒絕的回應之間保持一致的差距，同時使它們都符合當前政策模型的學習能力，從而保持信號的準確性。SSO能夠使政策模型的在線和離線訓練受益，並增強獎勵模型的訓練。我們使用兩個基礎模型Qwen2和Llama3.1來驗證SSO的有效性，結果表明它在迭代訓練過程中提供了準確、符合政策的偏好信號。在沒有任何手動標註或外部模型的情況下，SSO在六個主觀或客觀基準測試中顯著提高了性能。此外，SSO生成的偏好數據顯著提升了獎勵模型在Rewardbench上的性能。我們提出了一種可擴展的偏好優化方法，為更高效和有效的自動對齊鋪平了道路。

English

Automated alignment develops alignment systems with minimal human intervention. The key to automated alignment lies in providing learnable and accurate preference signals for preference learning without human annotation. In this paper, we introduce Self-Steering Optimization (SSO), an algorithm that autonomously generates high-quality preference signals based on predefined principles during iterative training, eliminating the need for manual annotation. SSO maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses while keeping them both on-policy to suit the current policy model's learning capacity. SSO can benefit the online and offline training of the policy model, as well as enhance the training of reward models. We validate the effectiveness of SSO with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals throughout iterative training. Without any manual annotation or external models, SSO leads to significant performance improvements across six subjective or objective benchmarks. Besides, the preference data generated by SSO significantly enhanced the performance of the reward model on Rewardbench. Our work presents a scalable approach to preference optimization, paving the way for more efficient and effective automated alignment.

透過自導向優化來對齊大型語言模型

Aligning Large Language Models via Self-Steering Optimization

摘要

Summary

Support

Support