COAT：用于内存高效FP8训练的压缩优化器状态和激活函数

摘要

FP8 训练已经成为提高训练效率的一种有前途的方法。现有的框架通过将 FP8 计算应用于线性层来加速训练，同时保留更高精度的优化器状态和激活，但未能充分优化内存使用。本文介绍了 COAT（Compressing Optimizer States and Activations for FP8 Training），这是一种新颖的 FP8 训练框架，旨在在训练大型模型时显著减少内存占用。COAT 通过两个关键创新解决了当前的限制：（1）动态范围扩展，将优化器状态分布更紧密地与 FP8 表示范围对齐，从而减少量化误差；（2）混合粒度激活量化，使用每张量和每组量化策略的组合来优化激活内存。实验证明，与 BF16 相比，COAT 能够有效地将端到端训练内存占用减少 1.54 倍，同时在各种任务（如大型语言模型预训练和微调以及视觉语言模型训练）中实现几乎无损性能。与 BF16 相比，COAT 还实现了 1.43 倍的端到端训练加速，性能与或超过 TransformerEngine 的加速相当。COAT 能够在更少的 GPU 上实现大型模型的高效全参数训练，并在分布式训练设置中使批量大小翻倍，为大规模模型训练提供了实用解决方案。代码可在 https://github.com/NVlabs/COAT 获取。

English

FP8 training has emerged as a promising method for improving training efficiency. Existing frameworks accelerate training by applying FP8 computation to linear layers while leaving optimizer states and activations in higher precision, which fails to fully optimize memory usage. This paper introduces COAT (Compressing Optimizer States and Activations for FP8 Training), a novel FP8 training framework designed to significantly reduce memory footprint when training large models. COAT addresses current limitations through two key innovations: (1) Dynamic Range Expansion, which aligns optimizer state distributions more closely with the FP8 representation range, thereby reducing quantization error, and (2) Mixed-Granularity Activation Quantization, which optimizes activation memory using a combination of per-tensor and per-group quantization strategies. Experiments demonstrate that COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16 while achieving nearly lossless performance across various tasks, such as Large Language Model pretraining and fine-tuning and Vision Language Model training. COAT also achieves a 1.43x end-to-end training speedup compared to BF16, performing on par with or surpassing TransformerEngine's speedup. COAT enables efficient full-parameter training of large models on fewer GPUs, and facilitates doubling the batch size in distributed training settings, providing a practical solution for scaling large-scale model training. The code is available at https://github.com/NVlabs/COAT.

COAT：用于内存高效FP8训练的压缩优化器状态和激活函数

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

摘要

Summary

Support

Support