SPAM: 具備動量重置的尖峰感知Adam，用於穩定的LLM訓練

摘要

大型語言模型（LLMs）展示了在各種任務上卓越的表現，然而它們的訓練仍然需要大量資源且容易受到訓練不穩定等關鍵挑戰的影響。這種不穩定的主要來源在於梯度和損失的突波，這些突波干擾了學習過程，通常導致昂貴的干預，如檢查點恢復和實驗重啟，進一步加劇了效率低下的問題。本文對LLM訓練期間觀察到的梯度突波進行了全面調查，揭示了它們在多個架構和數據集中的普遍存在。我們的分析顯示，這些突波可能比典型梯度大1000倍，嚴重惡化了模型的性能。為了解決這個問題，我們提出了一種新型優化器Spike-Aware Adam with Momentum Reset SPAM，通過動量重置和突波感知梯度截斷來對抗梯度突波。廣泛的實驗，包括預訓練和微調，表明SPAM在各種任務中持續優於Adam及其變體，包括（1）從60M到1B的LLM預訓練，（2）4位元LLM預訓練，（3）強化學習和（4）時間序列預測。此外，SPAM通過啟用稀疏動量實現了記憶效率訓練，僅維護和更新一部分動量項。在記憶約束條件下運行時，SPAM優於GaLore和Adam-Mini等最先進的記憶效率優化器。我們的工作強調了在LLM訓練中緩解梯度突波的重要性，並引入了一種有效的優化策略，提高了大規模訓練的穩定性和資源效率。代碼可在https://github.com/TianjinYellow/SPAM-Optimizer.git找到。

English

Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to 1000times larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git

SPAM: 具備動量重置的尖峰感知Adam，用於穩定的LLM訓練

SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

摘要

Support