MARS:釋放變異減少在訓練大型模型中的威力

MARS: Unleashing the Power of Variance Reduction for Training Large Models

November 15, 2024
作者: Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, Quanquan Gu
cs.AI

摘要

訓練深度神經網絡,以及近來的大型模型,需要高效且可擴展的優化器。像Adam、AdamW及其變體這樣的自適應梯度算法一直是這一任務的核心。儘管過去十年中開發了許多針對加速凸和非凸設置中隨機優化的方差減少算法,但方差減少在訓練深度神經網絡或大型語言模型方面並未取得廣泛成功。因此,在現代人工智能中,這仍然是一種不太受青睞的方法。在本文中,為了釋放方差減少的威力以實現大型模型的高效訓練,我們提出了一個統一的優化框架,稱為MARS(Make vAriance Reduction Shine),通過一種縮放的隨機遞歸動量技術,將預條件梯度方法與方差減少調和在一起。在我們的框架中,我們介紹了三個MARS的實例,分別利用基於AdamW、Lion和Shampoo的預條件梯度更新。我們還將我們的算法與現有的優化器之間建立了聯繫。對於訓練GPT-2模型的實驗結果表明,MARS始終以大幅度優於AdamW的表現。
English
Training deep neural networks--and more recently, large models--demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.

Summary

AI-Generated Summary

PDF132November 19, 2024