ChatPaper.aiChatPaper

在大规模模型合并中有哪些重要因素?

What Matters for Model Merging at Scale?

October 4, 2024
作者: Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, Tsendsuren Munkhdalai
cs.AI

摘要

模型合併旨在將多個專家模型結合成一個更具能力的單一模型,提供諸如減少存儲和服務成本、改善泛化能力以及支持分散式模型開發等好處。儘管具有潛力,先前的研究主要集中在合併少數小型模型上。這留下了許多未解之謎,關於模型尺寸擴展的影響以及它如何與其他關鍵因素相互作用,例如基礎模型質量和專家模型數量,以影響合併模型的性能。這項工作系統地評估了規模化模型合併的效用,考察了這些不同因素的影響。我們通過使用4種流行的合併方法(平均、任務算術、Dare和TIES)對完全微調的模型進行合併實驗,這些模型的參數範圍從10億到64億,並將最多8個不同專家模型進行合併。我們對合併模型在保留任務和零樣本泛化到未見過的保留任務上進行評估。我們的實驗提供了關於規模化模型合併以及不同因素之間相互作用的幾個新見解。首先,我們發現,當專家模型來自具有良好零樣本性能的強基礎模型時,合併效果更好。其次,較大的模型有助於更容易地進行合併。第三,合併一致地提高泛化能力。值得注意的是,當合併8個大型專家模型時,與多任務訓練模型相比,合併模型通常具有更好的泛化能力。第四,當處理較大的模型時,我們可以更好地合併更多專家模型。第五,不同的合併方法在較大規模下的行為非常相似。總的來說,我們的研究結果揭示了模型合併的一些有趣特性,同時也突出了一些限制。我們希望這項研究將成為未來研究中大規模合併的參考依據。
English
Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors -- like the base model quality and number of expert models -- , to affect the merged model's performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods -- Averaging, Task~Arithmetic, Dare, and TIES -- across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert's training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.

Summary

AI-Generated Summary

PDF82November 16, 2024