アダムはもういらない：初期化時の学習率スケーリングがすべてをカバーします

要旨

本研究では、深層ニューラルネットワークの訓練における適応的勾配法の必要性に疑問を投げかけます。SGD-SaIは、適応的な勾配降下法に運動量を持たせた効果的な単純な拡張です。SGD-SaIは、パラメータグループごとに学習率の初期スケーリング（SaI）を行い、それぞれの勾配信号対ノイズ比（g-SNR）によって誘導されます。適応的な2次モーメントに依存せずに学習率を調整することで、SGD-SaIは訓練の不均衡を最初のイテレーションから防ぎ、AdamWと比較してオプティマイザのメモリ使用量を半分に削減します。そのシンプルさと効率性にもかかわらず、SGD-SaIは、さまざまなTransformerベースのタスクの訓練において、AdamWと一致するかそれを上回る安定した性能を発揮し、SGDをTransformerの訓練に使用する上での長年の課題を効果的に克服します。SGD-SaIは、Vision Transformers（ViT）によるImageNet-1K分類や大規模言語モデル（LLMs、transformer decoder-only）に対するGPT-2の事前トレーニングなどで優れた性能を示し、ハイパーパラメータの変化に対する頑健性や多様なアプリケーションに対する実用性を実証します。また、LoRAのLLMsや拡散モデルのファインチューニングなどのタスクでの頑健性をテストし、最先端のオプティマイザを一貫して上回ることを示しました。メモリ効率の観点から、SGD-SaIはオプティマイザの状態において大幅なメモリ削減を実現し、GPT-2（15億パラメータ）ではAdamWと比較して5.93 GB、Llama2-7Bでは25.15 GBのメモリ使用量をフルプリシジョンの訓練設定で削減します。

English

In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2 pretraining for large language models (LLMs, transformer decoder-only), demonstrating robustness to hyperparameter variations and practicality for diverse applications. We further tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion models, where it consistently outperforms state-of-the-art optimizers. From a memory efficiency perspective, SGD-SaI achieves substantial memory savings for optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B compared to AdamW in full-precision training settings.

アダムはもういらない：初期化時の学習率スケーリングがすべてをカバーします

No More Adam: Learning Rate Scaling at Initialization is All You Need

要旨

Summary

Support

Support