APOLLO:类似SGD的记忆,达到AdamW级别的性能。
APOLLO: SGD-like Memory, AdamW-level Performance
December 6, 2024
作者: Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee
cs.AI
摘要
大型语言模型(LLMs)在训练过程中以AdamW优化器为代表极具内存密集性。这种内存负担要求使用更多或更高端的GPU,或者减小批量大小,从而限制了训练的可扩展性和吞吐量。为了解决这一问题,提出了各种内存高效优化器来减少优化器内存使用。然而,它们面临着关键挑战:(i)依赖昂贵的奇异值分解操作;(ii)与AdamW相比存在显著的性能折衷;(iii)仍然存在相当大的优化器内存开销以保持竞争性能。
在这项工作中,我们发现AdamW的学习率调整规则可以有效地粗化为结构化学习率更新。基于这一观点,我们提出了适用于内存高效LLM优化的近似梯度缩放(APOLLO),它使用基于纯随机投影的辅助低秩优化器状态来近似学习率缩放。这种结构化学习率更新规则使APOLLO在进一步减少内存的同时能够提供可比较的预训练性能。即使是其秩为1的变体APOLLO-Mini,也比具有SGD级内存成本的AdamW实现了更优越的预训练性能。
大量实验证明,APOLLO系列与AdamW持平甚至更优,同时通过几乎消除AdamW的优化状态实现了更大的内存节省。这些节省带来了重要的系统级好处:(1)增强吞吐量:在8个A100-80GB设置上,与AdamW相比支持4倍更大的批量大小,吞吐量提高了3倍。 (2)改善模型可扩展性:在A100-80GB GPU上使用朴素DDP进行LLaMA-13B的预训练,无需系统级优化。 (3)友好的低端GPU预训练:在单个GPU上使用不到12GB内存进行LLaMA-7B的预训练,采用了权重量化。
English
Large language models (LLMs) are notoriously memory-intensive during
training, particularly with the popular AdamW optimizer. This memory burden
necessitates using more or higher-end GPUs or reducing batch sizes, limiting
training scalability and throughput. To address this, various memory-efficient
optimizers have been proposed to reduce optimizer memory usage. However, they
face critical challenges: (i) reliance on costly SVD operations; (ii)
significant performance trade-offs compared to AdamW; and (iii) still
substantial optimizer memory overhead to maintain competitive performance.
In this work, we identify that AdamW's learning rate adaptation rule can be
effectively coarsened as a structured learning rate update. Based on this
insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM
Optimization (APOLLO), which approximates learning rate scaling using an
auxiliary low-rank optimizer state based on pure random projection. This
structured learning rate update rule makes APOLLO highly tolerant to further
memory reductions while delivering comparable pre-training performance. Even
its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance
compared to AdamW with SGD-level memory costs.
Extensive experiments demonstrate that the APOLLO series performs on-par with
or better than AdamW, while achieving greater memory savings by nearly
eliminating the optimization states of AdamW. These savings provide significant
system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB
setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model
Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without
system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training
LLaMA-7B on a single GPU using less than 12 GB of memory with weight
quantization.Summary
AI-Generated Summary