隐式模型融合的加权奖励偏好优化
Weighted-Reward Preference Optimization for Implicit Model Fusion
December 4, 2024
作者: Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan Shi, Xiaojun Quan
cs.AI
摘要
将异构开源LLM(Large Language Models)与不同架构和大小进行融合,有可能整合不同模型的优势,但现有的融合方法面临着重大挑战,如词汇对齐和合并分布矩阵。这些过程不仅复杂,而且容易引入噪音和错误。本文提出了一种隐式融合方法,称为加权奖励偏好优化(WRPO),利用源LLM和目标LLM之间的偏好优化来有效地转移它们的能力。WRPO消除了词汇对齐和矩阵融合的需要,并且可以高效扩展以适应各种LLM。为了解决源LLM和目标LLM之间的分布偏差,WRPO引入了一种渐进适应策略,逐渐将依赖于目标LLM的优选示例转移到源LLM。在MT-Bench、AlpacaEval-2和Arena-Hard基准测试上进行的大量实验表明,WRPO始终优于现有的知识融合方法和各种微调基线。将其应用于目标模型LLaMA3-8B-Instruct时,WRPO在AlpacaEval-2上以55.9%的长度控制胜率击败了GPT-4-Preview-1106,在Arena-Hard上以46.2%的胜率击败了GPT-4-0314。我们的代码可在https://github.com/SLIT-AI/WRPO找到。
English
While fusing heterogeneous open-source LLMs with varying architectures and
sizes can potentially integrate the strengths of different models, existing
fusion methods face significant challenges, such as vocabulary alignment and
merging distribution matrices. These procedures are not only complex but also
prone to introducing noise and errors. In this paper, we propose an implicit
fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages
preference optimization between the source LLMs and the target LLM to transfer
their capabilities effectively. WRPO eliminates the need for vocabulary
alignment and matrix fusion and can be efficiently scaled to accommodate
various LLMs. To address distributional deviations between the source and
target LLMs, WRPO introduces a progressive adaptation strategy that gradually
shifts reliance on preferred examples from the target LLM to the source LLMs.
Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks
demonstrate that WRPO consistently outperforms existing knowledge fusion
methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct
as the target model, WRPO achieves a length-controlled win rate of 55.9%
against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against
GPT-4-0314 on Arena-Hard. Our code is available at
https://github.com/SLIT-AI/WRPO.Summary
AI-Generated Summary