ChatPaper.aiChatPaper

利用FP4量化优化大型语言模型训练

Optimizing Large Language Model Training Using FP4 Quantization

January 28, 2025
作者: Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng
cs.AI

摘要

训练大型语言模型(LLMs)所需的计算需求不断增长,需要更高效的方法。量化训练提供了一种有前途的解决方案,通过使用低位算术运算来降低成本。虽然FP8精度已经证明是可行的,但利用FP4仍然是一个挑战,因为存在显著的量化误差和有限的表示能力。本研究引入了第一个针对LLMs的FP4训练框架,通过两个关键创新来解决这些挑战:一个可微分的量化估计器用于精确的权重更新,以及一种异常值夹紧和补偿策略,以防止激活崩溃。为确保稳定性,该框架集成了混合精度训练方案和矢量化量化。实验结果表明,我们的FP4框架实现了与BF16和FP8相当的准确性,降级最小,有效扩展到使用多达100B标记训练的13B参数LLMs。随着支持FP4的下一代硬件的出现,我们的框架为高效的超低精度训练奠定了基础。
English
The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.

Summary

AI-Generated Summary

PDF362January 29, 2025