MARS:释放方差减少在训练大型模型中的力量
MARS: Unleashing the Power of Variance Reduction for Training Large Models
November 15, 2024
作者: Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, Quanquan Gu
cs.AI
摘要
训练深度神经网络——以及最近的大型模型——需要高效且可扩展的优化器。像Adam、AdamW及其变种这样的自适应梯度算法一直是这一任务的核心。尽管过去十年中开发了许多旨在加速凸和非凸设置下随机优化的方差减少算法,但方差减少在训练深度神经网络或大型语言模型中并未取得广泛成功。因此,在现代人工智能中,它仍然是一种不太受青睐的方法。在本文中,为了释放方差减少的力量以实现大型模型的高效训练,我们提出了一个统一的优化框架,MARS(Make vAriance Reduction Shine),它通过一种缩放的随机递归动量技术将预条件梯度方法与方差减少相结合。在我们的框架中,我们介绍了三个MARS的实例,分别利用基于AdamW、Lion和Shampoo的预条件梯度更新。我们还将我们的算法与现有的优化器进行了联系。对训练GPT-2模型的实验结果表明,MARS始终以较大的优势胜过AdamW。
English
Training deep neural networks--and more recently, large models--demands
efficient and scalable optimizers. Adaptive gradient algorithms like Adam,
AdamW, and their variants have been central to this task. Despite the
development of numerous variance reduction algorithms in the past decade aimed
at accelerating stochastic optimization in both convex and nonconvex settings,
variance reduction has not found widespread success in training deep neural
networks or large language models. Consequently, it has remained a less favored
approach in modern AI. In this paper, to unleash the power of variance
reduction for efficient training of large models, we propose a unified
optimization framework, MARS (Make vAriance Reduction Shine), which reconciles
preconditioned gradient methods with variance reduction via a scaled stochastic
recursive momentum technique. Within our framework, we introduce three
instances of MARS that leverage preconditioned gradient updates based on AdamW,
Lion, and Shampoo, respectively. We also draw a connection between our
algorithms and existing optimizers. Experimental results on training GPT-2
models indicate that MARS consistently outperforms AdamW by a large margin.Summary
AI-Generated Summary