隱式模型融合的加權獎勵偏好優化
Weighted-Reward Preference Optimization for Implicit Model Fusion
December 4, 2024
作者: Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan Shi, Xiaojun Quan
cs.AI
摘要
融合異質開源LLM時,考慮到不同架構和大小的潛在優勢整合,現有的融合方法面臨著重大挑戰,如詞彙對齊和合併分佈矩陣。這些程序不僅複雜,還容易引入噪音和錯誤。本文提出一種隱式融合方法,稱為加權獎勵偏好優化(WRPO),該方法利用源LLM和目標LLM之間的偏好優化來有效地轉移它們的能力。WRPO消除了詞彙對齊和矩陣融合的需要,並且可以有效擴展以適應各種LLM。為了解決源LLM和目標LLM之間的分佈偏差,WRPO引入了一種漸進適應策略,逐漸將對目標LLM的依賴轉移到源LLM的優選示例上。在MT-Bench、AlpacaEval-2和Arena-Hard基準上進行的大量實驗表明,WRPO始終優於現有的知識融合方法和各種微調基準。當應用於目標模型LLaMA3-8B-Instruct時,WRPO在AlpacaEval-2上以55.9%的長度控制勝率擊敗了GPT-4-Preview-1106,在Arena-Hard上以46.2%的勝率擊敗了GPT-4-0314。我們的代碼可在https://github.com/SLIT-AI/WRPO找到。
English
While fusing heterogeneous open-source LLMs with varying architectures and
sizes can potentially integrate the strengths of different models, existing
fusion methods face significant challenges, such as vocabulary alignment and
merging distribution matrices. These procedures are not only complex but also
prone to introducing noise and errors. In this paper, we propose an implicit
fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages
preference optimization between the source LLMs and the target LLM to transfer
their capabilities effectively. WRPO eliminates the need for vocabulary
alignment and matrix fusion and can be efficiently scaled to accommodate
various LLMs. To address distributional deviations between the source and
target LLMs, WRPO introduces a progressive adaptation strategy that gradually
shifts reliance on preferred examples from the target LLM to the source LLMs.
Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks
demonstrate that WRPO consistently outperforms existing knowledge fusion
methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct
as the target model, WRPO achieves a length-controlled win rate of 55.9%
against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against
GPT-4-0314 on Arena-Hard. Our code is available at
https://github.com/SLIT-AI/WRPO.Summary
AI-Generated Summary