隐式模型融合的加权奖励偏好优化

摘要

将异构开源LLM（Large Language Models）与不同架构和大小进行融合，有可能整合不同模型的优势，但现有的融合方法面临着重大挑战，如词汇对齐和合并分布矩阵。这些过程不仅复杂，而且容易引入噪音和错误。本文提出了一种隐式融合方法，称为加权奖励偏好优化（WRPO），利用源LLM和目标LLM之间的偏好优化来有效地转移它们的能力。WRPO消除了词汇对齐和矩阵融合的需要，并且可以高效扩展以适应各种LLM。为了解决源LLM和目标LLM之间的分布偏差，WRPO引入了一种渐进适应策略，逐渐将依赖于目标LLM的优选示例转移到源LLM。在MT-Bench、AlpacaEval-2和Arena-Hard基准测试上进行的大量实验表明，WRPO始终优于现有的知识融合方法和各种微调基线。将其应用于目标模型LLaMA3-8B-Instruct时，WRPO在AlpacaEval-2上以55.9%的长度控制胜率击败了GPT-4-Preview-1106，在Arena-Hard上以46.2%的胜率击败了GPT-4-0314。我们的代码可在https://github.com/SLIT-AI/WRPO找到。

English

While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to the source LLMs. Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct as the target model, WRPO achieves a length-controlled win rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against GPT-4-0314 on Arena-Hard. Our code is available at https://github.com/SLIT-AI/WRPO.

隐式模型融合的加权奖励偏好优化

Weighted-Reward Preference Optimization for Implicit Model Fusion

摘要

Summary

Support

Support