ChatPaper.aiChatPaper

SALSA:基于汤的对齐学习,用于RLHF中更强的适应性

SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF

November 4, 2024
作者: Atoosa Chegini, Hamid Kazemi, Iman Mirzadeh, Dong Yin, Maxwell Horton, Moin Nabi, Mehrdad Farajtabar, Keivan Alizadeh
cs.AI

摘要

在大型语言模型(LLM)的开发中,从人类反馈中进行强化学习(RLHF)对于使模型与人类价值观和偏好保持一致至关重要。RLHF传统上依赖于当前策略与冻结的初始策略之间的Kullback-Leibler(KL)散度作为参考,将其作为一种惩罚添加到策略优化算法中,如Proximal Policy Optimization(PPO)。虽然这种约束可以防止模型偏离初始检查点太远,但它限制了对奖励空间的探索,降低了模型发现更高质量解决方案的能力。因此,策略优化通常被困在参数空间的狭窄区域中,导致次优的对齐和性能。本文提出了SALSA(基于汤的对齐学习以实现更强的适应性),这是一种新颖方法,旨在通过对两个独立的监督微调(SFT)模型进行权重空间平均来克服这些限制,从而创建一个更灵活且位置更佳的参考模型。这种模型汤允许在KL散度中有更大的偏差,并在不牺牲稳定性的情况下探索解决方案空间中有前途的区域。通过利用这个更健壮的参考模型,SALSA促进更好的探索,实现更高的奖励并改善模型的鲁棒性、超出分布的泛化能力和性能。我们通过在流行的开放模型(Llama2-7B、Mistral-7B和Gemma-2B)上进行广泛实验验证了SALSA的有效性,跨越各种基准(MT-Bench、Arena-Hard、UltraFeedback),在这些基准中,SALSA始终通过促进更深入的探索并在LLM中实现更优越的对齐, consistently surpasses PPO。
English
In Large Language Model (LLM) development, Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning models with human values and preferences. RLHF traditionally relies on the Kullback-Leibler (KL) divergence between the current policy and a frozen initial policy as a reference, which is added as a penalty in policy optimization algorithms like Proximal Policy Optimization (PPO). While this constraint prevents models from deviating too far from the initial checkpoint, it limits exploration of the reward landscape, reducing the model's ability to discover higher-quality solutions. As a result, policy optimization is often trapped in a narrow region of the parameter space, leading to suboptimal alignment and performance. This paper presents SALSA (Soup-based Alignment Learning for Stronger Adaptation), a novel approach designed to overcome these limitations by creating a more flexible and better located reference model through weight-space averaging of two independent supervised fine-tuned (SFT) models. This model soup allows for larger deviation in KL divergence and exploring a promising region of the solution space without sacrificing stability. By leveraging this more robust reference model, SALSA fosters better exploration, achieving higher rewards and improving model robustness, out-of-distribution generalization, and performance. We validate the effectiveness of SALSA through extensive experiments on popular open models (Llama2-7B, Mistral-7B, and Gemma-2B) across various benchmarks (MT-Bench, Arena-Hard, UltraFeedback), where it consistently surpasses PPO by fostering deeper exploration and achieving superior alignment in LLMs.

Summary

AI-Generated Summary

PDF82November 13, 2024