APOLLO：类似SGD的记忆，达到AdamW级别的性能。

摘要

大型语言模型（LLMs）在训练过程中以AdamW优化器为代表极具内存密集性。这种内存负担要求使用更多或更高端的GPU，或者减小批量大小，从而限制了训练的可扩展性和吞吐量。为了解决这一问题，提出了各种内存高效优化器来减少优化器内存使用。然而，它们面临着关键挑战：（i）依赖昂贵的奇异值分解操作；（ii）与AdamW相比存在显著的性能折衷；（iii）仍然存在相当大的优化器内存开销以保持竞争性能。在这项工作中，我们发现AdamW的学习率调整规则可以有效地粗化为结构化学习率更新。基于这一观点，我们提出了适用于内存高效LLM优化的近似梯度缩放（APOLLO），它使用基于纯随机投影的辅助低秩优化器状态来近似学习率缩放。这种结构化学习率更新规则使APOLLO在进一步减少内存的同时能够提供可比较的预训练性能。即使是其秩为1的变体APOLLO-Mini，也比具有SGD级内存成本的AdamW实现了更优越的预训练性能。大量实验证明，APOLLO系列与AdamW持平甚至更优，同时通过几乎消除AdamW的优化状态实现了更大的内存节省。这些节省带来了重要的系统级好处：（1）增强吞吐量：在8个A100-80GB设置上，与AdamW相比支持4倍更大的批量大小，吞吐量提高了3倍。（2）改善模型可扩展性：在A100-80GB GPU上使用朴素DDP进行LLaMA-13B的预训练，无需系统级优化。（3）友好的低端GPU预训练：在单个GPU上使用不到12GB内存进行LLaMA-7B的预训练，采用了权重量化。

English

Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

APOLLO：类似SGD的记忆，达到AdamW级别的性能。

APOLLO: SGD-like Memory, AdamW-level Performance

摘要

Summary

Support

Support