APOLLO：類似SGD記憶體，AdamW級性能

摘要

大型語言模型（LLMs）在訓練過程中以AdamW優化器特別佔用記憶體，這尤其顯著。這種記憶體負擔需要使用更多或更高階的GPU，或是降低批次大小，限制了訓練的擴展性和吞吐量。為了應對這個問題，提出了各種節省記憶體的優化器來降低優化器的記憶體使用量。然而，它們面臨著關鍵挑戰：（i）依賴昂貴的奇異值分解（SVD）操作；（ii）與AdamW相比存在顯著的性能折衷；以及（iii）仍然存在相當大的優化器記憶體開銷以維持競爭性能。在這項工作中，我們發現可以有效地將AdamW的學習率適應規則粗化為結構化學習率更新。基於這一洞察，我們提出了用於節省記憶體的LLM優化的近似梯度縮放（APOLLO），它使用基於純隨機投影的輔助低秩優化器狀態來近似學習率縮放。這種結構化學習率更新規則使APOLLO對進一步減少記憶體具有很高的容忍度，同時提供可比擬的預訓練性能。即使是其秩為1的變體APOLLO-Mini，也比使用SGD級別記憶成本的AdamW實現了優越的預訓練性能。大量實驗表明，APOLLO系列的性能與AdamW相當或更好，同時通過幾乎消除AdamW的優化狀態實現了更大的記憶體節省。這些節省帶來了顯著的系統級好處：（1）增強吞吐量：在8xA100-80GB設置中，支持4倍更大批次大小，相較於AdamW實現3倍的吞吐量。（2）改進模型擴展性：在A100-80GB GPU上使用naive DDP預訓練LLaMA-13B，無需系統級優化。（3）友好的低階GPU預訓練：在單個GPU上使用不到12 GB記憶體進行LLaMA-7B的預訓練，並進行權重量化。

English

Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

APOLLO：類似SGD記憶體，AdamW級性能

APOLLO: SGD-like Memory, AdamW-level Performance

摘要

Support