암묵적 모델 퓨전을 위한 가중 보상 선호 최적화

초록

이질적인 오픈 소스 LLM(언어 모델)들을 다양한 아키텍처와 크기로 융합하는 것은 서로 다른 모델의 장점을 통합할 수 있는 잠재력을 가지고 있지만, 기존의 융합 방법은 어휘 정렬과 분포 행렬 병합과 같은 중요한 도전에 직면하고 있다. 이러한 절차들은 복잡할 뿐만 아니라 잡음과 오류를 도입할 가능성이 있다. 본 논문에서는 소스 LLM과 대상 LLM 간의 선호도 최적화를 활용하여 그들의 능력을 효과적으로 전이하는 암묵적 융합 방법인 가중 보상 선호도 최적화(WRPO)를 제안한다. WRPO는 어휘 정렬과 행렬 융합이 필요 없으며 다양한 LLM을 수용할 수 있는 효율적인 확장이 가능하다. 소스와 대상 LLM 간의 분포적 차이를 해결하기 위해 WRPO는 점진적 적응 전략을 도입하여 대상 LLM에서 소스 LLM으로 선호되는 예제에 대한 의존성을 서서히 이동시킨다. MT-Bench, AlpacaEval-2, Arena-Hard 벤치마크에서의 광범위한 실험 결과는 WRPO가 기존의 지식 융합 방법과 다양한 세밀 조정 기준선을 일관되게 능가함을 보여준다. 대상 모델로 LLaMA3-8B-Instruct를 적용한 경우, WRPO는 AlpacaEval-2에서 GPT-4-Preview-1106에 대해 55.9%의 길이 제어된 승률을 달성하고 Arena-Hard에서 GPT-4-0314에 대해 46.2%의 승률을 기록한다. 우리의 코드는 https://github.com/SLIT-AI/WRPO에서 확인할 수 있다.

English

While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to the source LLMs. Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct as the target model, WRPO achieves a length-controlled win rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against GPT-4-0314 on Arena-Hard. Our code is available at https://github.com/SLIT-AI/WRPO.

암묵적 모델 퓨전을 위한 가중 보상 선호 최적화

Weighted-Reward Preference Optimization for Implicit Model Fusion

초록

Summary

Support