如果你无法使用它们,就回收它们:优化大规模合并以减轻性能折衷。
If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
December 5, 2024
作者: Muhammad Khalifa, Yi-Chern Tan, Arash Ahmadian, Tom Hosking, Honglak Lee, Lu Wang, Ahmet Üstün, Tom Sherborne, Matthias Gallé
cs.AI
摘要
模型合并在结合专家模型方面显示出巨大潜力,但当合并训练于多个任务的“通用”模型时,合并的好处尚不清楚。我们探讨了在大型(约100B模型)背景下的合并,通过回收在不同任务之间展现权衡的检查点。这些检查点通常是在开发前沿模型的过程中创建的,许多次优的检查点通常会被丢弃。鉴于从不同训练运行(例如不同阶段、目标、超参数和数据混合)中获得的模型检查点池,这些检查点自然地展示了在不同语言能力之间的权衡(例如遵循指令与生成代码),我们调查了合并是否可以将这些次优模型回收为帕累托最优模型。我们的优化算法调整每个检查点在线性组合中的权重,从而产生优于单个模型和基于合并的基线的帕累托最优模型。进一步的分析显示,良好的合并往往包括几乎所有具有非零权重的检查点,这表明即使最初看似糟糕的检查点也可以对最终的良好合并产生贡献。
English
Model merging has shown great promise at combining expert models, but the
benefit of merging is unclear when merging ``generalist'' models trained on
many tasks. We explore merging in the context of large (sim100B) models, by
recycling checkpoints that exhibit tradeoffs among different tasks.
Such checkpoints are often created in the process of developing a frontier
model, and many suboptimal ones are usually discarded. Given a pool of model
checkpoints obtained from different training runs (e.g., different stages,
objectives, hyperparameters, and data mixtures), which naturally show tradeoffs
across different language capabilities (e.g., instruction following vs. code
generation), we investigate whether merging can recycle such suboptimal models
into a Pareto-optimal one. Our optimization algorithm tunes the weight of each
checkpoint in a linear combination, resulting in a Pareto-optimal models that
outperforms both individual models and merge-based baselines. Further analysis
shows that good merges tend to include almost all checkpoints with with
non-zero weights, indicating that even seemingly bad initial checkpoints can
contribute to good final merges.Summary
AI-Generated Summary