利用FP4量化优化大型语言模型训练
Optimizing Large Language Model Training Using FP4 Quantization
January 28, 2025
作者: Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng
cs.AI
摘要
训练大型语言模型(LLMs)所需的计算需求不断增长,需要更高效的方法。量化训练提供了一种有前途的解决方案,通过使用低位算术运算来降低成本。虽然FP8精度已经证明是可行的,但利用FP4仍然是一个挑战,因为存在显著的量化误差和有限的表示能力。本研究引入了第一个针对LLMs的FP4训练框架,通过两个关键创新来解决这些挑战:一个可微分的量化估计器用于精确的权重更新,以及一种异常值夹紧和补偿策略,以防止激活崩溃。为确保稳定性,该框架集成了混合精度训练方案和矢量化量化。实验结果表明,我们的FP4框架实现了与BF16和FP8相当的准确性,降级最小,有效扩展到使用多达100B标记训练的13B参数LLMs。随着支持FP4的下一代硬件的出现,我们的框架为高效的超低精度训练奠定了基础。
English
The growing computational demands of training large language models (LLMs)
necessitate more efficient methods. Quantized training presents a promising
solution by enabling low-bit arithmetic operations to reduce these costs. While
FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge
due to significant quantization errors and limited representational capacity.
This work introduces the first FP4 training framework for LLMs, addressing
these challenges with two key innovations: a differentiable quantization
estimator for precise weight updates and an outlier clamping and compensation
strategy to prevent activation collapse. To ensure stability, the framework
integrates a mixed-precision training scheme and vector-wise quantization.
Experimental results demonstrate that our FP4 framework achieves accuracy
comparable to BF16 and FP8, with minimal degradation, scaling effectively to
13B-parameter LLMs trained on up to 100B tokens. With the emergence of
next-generation hardware supporting FP4, our framework sets a foundation for
efficient ultra-low precision training.Summary
AI-Generated Summary