如果你無法使用它們,就回收它們:優化大規模合併以減輕性能折衷。
If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
December 5, 2024
作者: Muhammad Khalifa, Yi-Chern Tan, Arash Ahmadian, Tom Hosking, Honglak Lee, Lu Wang, Ahmet Üstün, Tom Sherborne, Matthias Gallé
cs.AI
摘要
模型合併在結合專家模型方面表現出很大的潛力,但在合併訓練於多項任務的「通才」模型時,合併的好處尚不清楚。我們探索在大型(約100B模型)的情況下進行合併,通過回收展現在不同任務之間權衡的檢查點。這些檢查點通常在開發前沿模型的過程中創建,而許多次優的檢查點通常會被丟棄。鑒於從不同訓練運行(例如不同階段、目標、超參數和數據混合)獲得的模型檢查點池,這些檢查點自然地展示了在不同語言能力之間的權衡(例如遵循指示 vs. 代碼生成),我們探討是否合併可以將這些次優模型回收為帕累托最優模型。我們的優化算法調整每個檢查點在線性組合中的權重,從而產生一個優於單個模型和基於合併的基準線的帕累托最優模型。進一步的分析顯示,良好的合併往往包括幾乎所有具有非零權重的檢查點,這表明即使最初看似不好的檢查點也可以對最終良好的合併產生貢獻。
English
Model merging has shown great promise at combining expert models, but the
benefit of merging is unclear when merging ``generalist'' models trained on
many tasks. We explore merging in the context of large (sim100B) models, by
recycling checkpoints that exhibit tradeoffs among different tasks.
Such checkpoints are often created in the process of developing a frontier
model, and many suboptimal ones are usually discarded. Given a pool of model
checkpoints obtained from different training runs (e.g., different stages,
objectives, hyperparameters, and data mixtures), which naturally show tradeoffs
across different language capabilities (e.g., instruction following vs. code
generation), we investigate whether merging can recycle such suboptimal models
into a Pareto-optimal one. Our optimization algorithm tunes the weight of each
checkpoint in a linear combination, resulting in a Pareto-optimal models that
outperforms both individual models and merge-based baselines. Further analysis
shows that good merges tend to include almost all checkpoints with with
non-zero weights, indicating that even seemingly bad initial checkpoints can
contribute to good final merges.Summary
AI-Generated Summary