中介者:基于内存高效的LLM合并,减少参数冲突和基于不确定性的路由。
Mediator: Memory-efficient LLM Merging with Less Parameter Conflicts and Uncertainty Based Routing
February 6, 2025
作者: Kunfeng Lai, Zhenheng Tang, Xinglin Pan, Peijie Dong, Xiang Liu, Haolan Chen, Li Shen, Bo Li, Xiaowen Chu
cs.AI
摘要
模型合并将在不同任务上微调的大型语言模型(LLMs)聚合成一个更强大的模型。然而,模型之间的参数冲突会导致平均性能下降。模型路由通过在推断过程中选择个别模型来解决这一问题,但会带来过多的存储和计算成本,并且无法充分利用不同模型的共同知识。在这项工作中,我们观察到不同层次的参数冲突程度各不相同。基于这一观察,我们对具有最小参数冲突的层进行平均处理,并针对存在显著冲突的层使用一种新颖的任务级专家路由。为了进一步降低存储成本,受任务算术稀疏性的启发,我们将多个微调专家分解为一个密集专家和几个稀疏专家。考虑到分布之外的样本,我们根据输入数据的任务不确定性选择并合并适当的专家。我们在参数规模不同的LLaMA和Qwen上进行了大量实验,并在真实推理任务上进行了评估。结果表明,与现有方法相比,我们的方法始终实现了显著的性能改进,同时系统成本更低。
English
Model merging aggregates Large Language Models (LLMs) finetuned on different
tasks into a stronger one. However, parameter conflicts between models leads to
performance degradation in averaging. While model routing addresses this issue
by selecting individual models during inference, it imposes excessive storage
and compute costs, and fails to leverage the common knowledge from different
models. In this work, we observe that different layers exhibit varying levels
of parameter conflicts. Building on this insight, we average layers with
minimal parameter conflicts and use a novel task-level expert routing for
layers with significant conflicts. To further reduce storage costs, inspired by
task arithmetic sparsity, we decouple multiple fine-tuned experts into a dense
expert and several sparse experts. Considering the out-of-distribution samples,
we select and merge appropriate experts based on the task uncertainty of the
input data. We conduct extensive experiments on both LLaMA and Qwen with
varying parameter scales, and evaluate on real-world reasoning tasks. Results
demonstrate that our method consistently achieves significant performance
improvements while requiring less system cost compared to existing methods.Summary
AI-Generated Summary