隱式模型融合的加權獎勵偏好優化

摘要

融合異質開源LLM時，考慮到不同架構和大小的潛在優勢整合，現有的融合方法面臨著重大挑戰，如詞彙對齊和合併分佈矩陣。這些程序不僅複雜，還容易引入噪音和錯誤。本文提出一種隱式融合方法，稱為加權獎勵偏好優化（WRPO），該方法利用源LLM和目標LLM之間的偏好優化來有效地轉移它們的能力。WRPO消除了詞彙對齊和矩陣融合的需要，並且可以有效擴展以適應各種LLM。為了解決源LLM和目標LLM之間的分佈偏差，WRPO引入了一種漸進適應策略，逐漸將對目標LLM的依賴轉移到源LLM的優選示例上。在MT-Bench、AlpacaEval-2和Arena-Hard基準上進行的大量實驗表明，WRPO始終優於現有的知識融合方法和各種微調基準。當應用於目標模型LLaMA3-8B-Instruct時，WRPO在AlpacaEval-2上以55.9%的長度控制勝率擊敗了GPT-4-Preview-1106，在Arena-Hard上以46.2%的勝率擊敗了GPT-4-0314。我們的代碼可在https://github.com/SLIT-AI/WRPO找到。

English

While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to the source LLMs. Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct as the target model, WRPO achieves a length-controlled win rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against GPT-4-0314 on Arena-Hard. Our code is available at https://github.com/SLIT-AI/WRPO.

隱式模型融合的加權獎勵偏好優化

Weighted-Reward Preference Optimization for Implicit Model Fusion

摘要

Support