SPAM：具有动量重置的Spike-Aware Adam，用于稳定的LLM训练

摘要

大型语言模型（LLMs）已在各种任务中展现出卓越的性能，但它们的训练仍然需要大量资源，并且容易受到训练不稳定等关键挑战的影响。这种不稳定的主要原因之一是梯度和损失的突增，这会干扰学习过程，通常导致昂贵的干预措施，如检查点恢复和实验重启，进一步加剧低效性。本文对LLM训练中观察到的梯度突增进行了全面调查，揭示了它们在多种架构和数据集中的普遍存在。我们的分析显示，这些突增可能比典型梯度大1000倍，严重损害模型性能。为了解决这个问题，我们提出了一种新型优化器Spike-Aware Adam with Momentum Reset SPAM，旨在通过动量重置和感知梯度裁剪来抵消梯度突增。广泛的实验，包括预训练和微调，表明SPAM在各种任务中持续优于Adam及其变种，包括（1）从60M到1B的LLM预训练，（2）4位LLM预训练，（3）强化学习和（4）时间序列预测。此外，SPAM通过启用稀疏动量实现了内存高效训练，仅维护和更新动量项的子集。在内存限制条件下运行时，SPAM胜过GaLore和Adam-Mini等最先进的内存高效优化器。我们的工作强调了在LLM训练中减轻梯度突增的重要性，并引入了一种有效的优化策略，提高了训练稳定性和资源效率。代码可在https://github.com/TianjinYellow/SPAM-Optimizer.git获取。

English

Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to 1000times larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git

SPAM：具有动量重置的Spike-Aware Adam，用于稳定的LLM训练

SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

摘要

Summary

Support