APOLLO:類似SGD記憶體,AdamW級性能

APOLLO: SGD-like Memory, AdamW-level Performance

December 6, 2024
作者: Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee
cs.AI

摘要

大型語言模型(LLMs)在訓練過程中以AdamW優化器特別佔用記憶體,這尤其顯著。這種記憶體負擔需要使用更多或更高階的GPU,或是降低批次大小,限制了訓練的擴展性和吞吐量。為了應對這個問題,提出了各種節省記憶體的優化器來降低優化器的記憶體使用量。然而,它們面臨著關鍵挑戰:(i)依賴昂貴的奇異值分解(SVD)操作;(ii)與AdamW相比存在顯著的性能折衷;以及(iii)仍然存在相當大的優化器記憶體開銷以維持競爭性能。 在這項工作中,我們發現可以有效地將AdamW的學習率適應規則粗化為結構化學習率更新。基於這一洞察,我們提出了用於節省記憶體的LLM優化的近似梯度縮放(APOLLO),它使用基於純隨機投影的輔助低秩優化器狀態來近似學習率縮放。這種結構化學習率更新規則使APOLLO對進一步減少記憶體具有很高的容忍度,同時提供可比擬的預訓練性能。即使是其秩為1的變體APOLLO-Mini,也比使用SGD級別記憶成本的AdamW實現了優越的預訓練性能。 大量實驗表明,APOLLO系列的性能與AdamW相當或更好,同時通過幾乎消除AdamW的優化狀態實現了更大的記憶體節省。這些節省帶來了顯著的系統級好處:(1)增強吞吐量:在8xA100-80GB設置中,支持4倍更大批次大小,相較於AdamW實現3倍的吞吐量。 (2)改進模型擴展性:在A100-80GB GPU上使用naive DDP預訓練LLaMA-13B,無需系統級優化。 (3)友好的低階GPU預訓練:在單個GPU上使用不到12 GB記憶體進行LLaMA-7B的預訓練,並進行權重量化。
English
Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

Summary

AI-Generated Summary

PDF382December 9, 2024