스팸: 안정적인 LLM 훈련을 위한 모멘텀 리셋과 함께 스파이크 인식 아담

초록

대규모 언어 모델(Large Language Models, LLMs)은 다양한 작업에서 뛰어난 성능을 보여주었지만, 그들의 훈련은 여전히 매우 많은 자원이 필요하며 훈련 불안정성과 같은 중요한 도전에 취약하다. 이러한 불안정성의 주요 원인 중 하나는 그래디언트와 손실의 급등으로, 학습 과정을 방해하고 종종 체크포인트 복구와 실험 재시작과 같은 비용이 많이 드는 개입을 유발하여 비효율성을 더욱 증폭시킨다. 본 논문은 LLM 훈련 중 관찰된 그래디언트 스파이크에 대한 포괄적인 조사를 제시하며, 이러한 스파이크가 여러 아키텍처와 데이터셋에서 널리 발생한다는 것을 밝힌다. 우리의 분석은 이러한 스파이크가 일반적인 그래디언트보다 최대 1000배 크다는 것을 보여주며, 이는 모델 성능을 상당히 악화시킨다. 이 문제를 해결하기 위해 우리는 모멘텀 리셋과 스파이크 인식 그래디언트 클리핑을 통해 그래디언트 스파이크에 대항하기 위해 설계된 Spike-Aware Adam with Momentum Reset SPAM이라는 새로운 옵티마이저를 제안한다. 60M에서 1B까지의 LLM 사전 훈련, 4비트 LLM 사전 훈련, 강화 학습 및 시계열 예측을 포함한 다양한 작업에서 SPAM이 Adam 및 그 변형을 일관되게 능가함을 보여주는 포괄적인 실험을 통해 SPAM이 어떤 작업에서도 Adam 및 그 변형을 일관되게 능가함을 보여준다. 또한 SPAM은 희소 모멘텀을 가능하게 함으로써 메모리 효율적인 훈련을 용이하게 한다. 메모리 제약 조건 하에서 운영할 때 SPAM은 GaLore 및 Adam-Mini와 같은 최신 메모리 효율적 옵티마이저를 능가한다. 우리의 연구는 LLM 훈련 중 그래디언트 스파이크를 완화하는 것의 중요성을 강조하며 규모에 맞는 훈련 안정성과 자원 효율성을 향상시키는 효과적인 최적화 전략을 소개한다. 코드는 https://github.com/TianjinYellow/SPAM-Optimizer.git에서 확인할 수 있다.

English

Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to 1000times larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git

스팸: 안정적인 LLM 훈련을 위한 모멘텀 리셋과 함께 스파이크 인식 아담

SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

초록

Support