ChatPaper.aiChatPaper

SALSA:基於湯的對齊學習,以增強 RLHF 的適應能力

SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF

November 4, 2024
作者: Atoosa Chegini, Hamid Kazemi, Iman Mirzadeh, Dong Yin, Maxwell Horton, Moin Nabi, Mehrdad Farajtabar, Keivan Alizadeh
cs.AI

摘要

在大型語言模型(LLM)的開發中,從人類反饋中進行強化學習(RLHF)對於使模型與人類價值觀和偏好保持一致至關重要。RLHF傳統上依賴當前策略與凍結初始策略之間的Kullback-Leibler(KL)散度作為參考,將其作為懲罰添加到策略優化算法(如Proximal Policy Optimization(PPO))中。雖然這種限制防止模型偏離初始檢查點太遠,但它限制了對獎勵空間的探索,降低了模型發現更高質量解決方案的能力。因此,策略優化通常被困在參數空間的狹窄區域中,導致次優的對齊和性能。本文提出了SALSA(基於湯的對齊學習以實現更強適應性),這是一種新方法,旨在通過對兩個獨立的監督微調(SFT)模型的權重空間平均化,創建一個更靈活且更好定位的參考模型,從而克服這些限制。這種模型湯允許在KL散度中有更大的偏差,並在不犧牲穩定性的情況下探索解決方案空間中有前途的區域。通過利用這個更穩健的參考模型,SALSA促進了更好的探索,實現了更高的獎勵並改善了模型的韌性、超出分發的泛化能力和性能。我們通過在流行的開放模型(Llama2-7B、Mistral-7B和Gemma-2B)上進行大量實驗來驗證SALSA的有效性,涵蓋各種基準測試(MT-Bench、Arena-Hard、UltraFeedback),在這些實驗中,SALSA始終通過促進更深入的探索並在LLM中實現更優異的對齊, consistently surpasses PPO。
English
In Large Language Model (LLM) development, Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning models with human values and preferences. RLHF traditionally relies on the Kullback-Leibler (KL) divergence between the current policy and a frozen initial policy as a reference, which is added as a penalty in policy optimization algorithms like Proximal Policy Optimization (PPO). While this constraint prevents models from deviating too far from the initial checkpoint, it limits exploration of the reward landscape, reducing the model's ability to discover higher-quality solutions. As a result, policy optimization is often trapped in a narrow region of the parameter space, leading to suboptimal alignment and performance. This paper presents SALSA (Soup-based Alignment Learning for Stronger Adaptation), a novel approach designed to overcome these limitations by creating a more flexible and better located reference model through weight-space averaging of two independent supervised fine-tuned (SFT) models. This model soup allows for larger deviation in KL divergence and exploring a promising region of the solution space without sacrificing stability. By leveraging this more robust reference model, SALSA fosters better exploration, achieving higher rewards and improving model robustness, out-of-distribution generalization, and performance. We validate the effectiveness of SALSA through extensive experiments on popular open models (Llama2-7B, Mistral-7B, and Gemma-2B) across various benchmarks (MT-Bench, Arena-Hard, UltraFeedback), where it consistently surpasses PPO by fostering deeper exploration and achieving superior alignment in LLMs.

Summary

AI-Generated Summary

PDF82November 13, 2024