凸优化理论与大模型训练学习率调度之间的惊人一致性
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training
January 31, 2025
作者: Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach
cs.AI
摘要
我们展示了用于大型模型训练的学习率调度表现出与非光滑凸优化理论中性能上界惊人相似的行为。我们为具有线性冷却的恒定调度提供了一个上界;特别地,由于缺少对数项,冷却的实际好处体现在该上界中。此外,我们展示了优化理论与实践之间这种惊人的接近匹配可以用于学习率调整:我们通过(i)扩展用于最优学习率继续训练的调度,以及(ii)在调度之间传递最优学习率,实现了对124M和210M Llama类型模型训练的显着改进。
English
We show that learning-rate schedules for large model training behave
surprisingly similar to a performance bound from non-smooth convex optimization
theory. We provide a bound for the constant schedule with linear cooldown; in
particular, the practical benefit of cooldown is reflected in the bound due to
the absence of logarithmic terms. Further, we show that this surprisingly close
match between optimization theory and practice can be exploited for
learning-rate tuning: we achieve noticeable improvements for training 124M and
210M Llama-type models by (i) extending the schedule for continued training
with optimal learning-rate, and (ii) transferring the optimal learning-rate
across schedules.Summary
AI-Generated Summary