ChatPaper.aiChatPaper

Stable-SPAM:如何在4比特精度下比16比特Adam更稳定地进行训练

Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

February 24, 2025
作者: Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu
cs.AI

摘要

本文全面评估了近期提出的几种用于4位训练的优化器,发现低比特精度会放大对学习率的敏感性,并常常导致梯度范数不稳定,从而在较高学习率下引发发散。其中,SPAM作为一种新型优化器,具备动量重置和尖峰感知梯度裁剪特性,在不同比特级别上表现最佳,但难以稳定梯度范数,需要仔细调整学习率。为克服这些局限,我们提出了Stable-SPAM,它融合了增强的梯度归一化与裁剪技术。具体而言,Stable-SPAM(1)通过追踪历史最大值自适应更新尖峰梯度的裁剪阈值;(2)基于历史l_2范数统计对整个梯度矩阵进行归一化;(3)继承SPAM的动量重置机制,定期重置Adam的一阶和二阶矩,以减轻尖峰梯度的累积。大量实验表明,Stable-SPAM在4位大语言模型训练中有效稳定了梯度范数,相比Adam和SPAM展现出更优的性能。特别地,使用Stable-SPAM训练的4位LLaMA-1B模型,在困惑度上比采用Adam训练的BF16 LLaMA-1B模型高出最多2点。此外,当两者均在4位下训练时,Stable-SPAM达到与Adam相同的损失,而所需训练步数仅为后者的一半。代码已发布于https://github.com/TianjinYellow/StableSPAM.git。
English
This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical l_2-norm statistics; and (3) inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to 2 perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.

Summary

AI-Generated Summary

PDF162February 25, 2025