아폴로: SGD와 유사한 메모리, AdamW 수준의 성능

초록

대형 언어 모델(Large language models, LLMs)은 훈련 중에 특히 인기 있는 AdamW 옵티마이저를 사용할 때 메모리 집약적인 것으로 악명이 있습니다. 이 메모리 부담으로 인해 더 많거나 고성능의 GPU를 사용하거나 배치 크기를 줄여야 하며, 이는 훈련 확장성과 처리량을 제한합니다. 이에 대응하기 위해 다양한 메모리 효율적인 옵티마이저가 제안되어 왔지만, 이들은 중요한 도전에 직면합니다: (i) 비용이 많이 드는 SVD 연산에 의존함, (ii) AdamW와 비교했을 때 상당한 성능 저하, (iii) 경쟁력 있는 성능을 유지하기 위한 여전히 상당한 옵티마이저 메모리 오버헤드. 본 연구에서는 AdamW의 학습률 적응 규칙이 구조화된 학습률 업데이트로 효과적으로 변환될 수 있다는 것을 확인했습니다. 이 통찰을 바탕으로 순수한 랜덤 프로젝션을 기반으로 보조 저랭크 옵티마이저 상태를 사용하여 학습률 스케일링을 근사화하는 메모리 효율적인 LLM 최적화를 위한 Approximated Gradient Scaling (APOLLO)을 제안합니다. 이 구조화된 학습률 업데이트 규칙은 APOLLO을 추가적인 메모리 감소에 매우 견고하게 만들어주며, 비슷한 사전 훈련 성능을 제공합니다. 심지어 랭크-1 변형인 APOLLO-Mini도 SGD 수준의 메모리 비용과 비교하여 우수한 사전 훈련 성능을 달성합니다. 광범위한 실험을 통해 APOLLO 시리즈가 AdamW와 비슷하거나 더 나은 성능을 발휘하면서 AdamW의 최적화 상태를 거의 제거하여 더 많은 메모리를 절약한다는 것을 입증했습니다. 이러한 절약은 중요한 시스템 수준의 이점을 제공합니다: (1) 향상된 처리량: 8xA100-80GB 설정에서 AdamW와 비교하여 4배 더 큰 배치 크기를 지원하여 3배의 처리량. (2) 모델 확장성 향상: 시스템 수준 최적화 없이 A100-80GB GPU에서 naive DDP로 LLaMA-13B 사전 훈련. (3) 저성능 GPU 친화적 사전 훈련: 가중치 양자화를 사용하여 12GB 미만의 메모리를 사용하여 단일 GPU에서 LLaMA-7B 사전 훈련.

English

Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

아폴로: SGD와 유사한 메모리, AdamW 수준의 성능

APOLLO: SGD-like Memory, AdamW-level Performance

초록

Summary

Support