더 이상 Adam: 초기화 시 학습률 스케일링만으로 충분합니다

초록

본 연구에서는 딥 신경망을 훈련하기 위한 적응형 그래디언트 방법의 필요성에 대해 의문을 제기합니다. SGD-SaI는 우세한 확률적 그래디언트 하강법(SGDM)에 간단하면서도 효과적인 개선을 제공합니다. SGD-SaI는 초기에 학습률 스케일링(SaI)을 수행하여 각각의 그래디언트 신호 대 잡음 비율(g-SNR)에 따라 구분된 매개변수 그룹에 안내합니다. 두 번째 순서 모멘텀에 의존하지 않고 학습률을 조정함으로써, SGD-SaI는 훈련 불균형을 첫 번째 반복부터 방지하고 AdamW와 비교하여 옵티마이저의 메모리 사용량을 절반으로 줄입니다. 그 간단함과 효율성에도 불구하고, SGD-SaI는 다양한 Transformer 기반 작업의 훈련에서 AdamW와 일치하거나 능가하여, SGD를 사용하여 Transformer를 훈련하는 오랜 과제를 효과적으로 극복합니다. SGD-SaI는 Vision Transformers(ViT)를 사용한 ImageNet-1K 분류 및 대규모 언어 모델(LLMs, transformer decoder-only)인 GPT-2 사전 훈련에서 뛰어나며, SGD-SaI는 하이퍼파라미터 변화에 대해 견고하며 다양한 응용 프로그램에 대한 실용성을 입증합니다. 또한, LLMs 및 확산 모델에 대한 LoRA 미세 조정과 같은 작업에서 SGD-SaI는 최첨단 옵티마이저를 일관되게 능가합니다. 메모리 효율성 측면에서, SGD-SaI는 옵티마이저 상태에 대한 상당한 메모리 절약을 달성하여, GPT-2(15억 개 파라미터)의 경우 AdamW와 비교하여 5.93GB, Llama2-7B의 경우 25.15GB의 메모리 사용량을 절감합니다.

English

In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2 pretraining for large language models (LLMs, transformer decoder-only), demonstrating robustness to hyperparameter variations and practicality for diverse applications. We further tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion models, where it consistently outperforms state-of-the-art optimizers. From a memory efficiency perspective, SGD-SaI achieves substantial memory savings for optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B compared to AdamW in full-precision training settings.

더 이상 Adam: 초기화 시 학습률 스케일링만으로 충분합니다

No More Adam: Learning Rate Scaling at Initialization is All You Need

초록

Support