COAT: 메모리 효율적인 FP8 훈련을 위한 옵티마이저 상태와 활성화 압축

초록

FP8 훈련은 훈련 효율을 향상시키는 유망한 방법으로 등장했습니다. 기존 프레임워크는 FP8 계산을 선형 레이어에 적용하여 훈련을 가속화하지만 옵티마이저 상태와 활성화를 더 높은 정밀도로 유지하여 메모리 사용을 완전히 최적화하지 못합니다. 본 논문에서는 대규모 모델 훈련 시 메모리 풋프린트를 크게 줄이기 위해 설계된 혁신적인 FP8 훈련 프레임워크인 COAT(Compressing Optimizer States and Activations for FP8 Training)을 소개합니다. COAT은 현재의 제한 사항을 두 가지 주요 혁신을 통해 해결합니다: (1) Optimizer 상태 분포를 FP8 표현 범위와 보다 밀접하게 일치시켜 양자화 오류를 줄이는 Dynamic Range Expansion 및 (2) Mixed-Granularity Activation Quantization을 통해 텐서당 및 그룹당 양자화 전략의 조합을 사용하여 활성화 메모리를 최적화합니다. 실험 결과, COAT은 BF16 대비 1.54배의 훈련 메모리 풋프린트를 효과적으로 줄이면서 Large Language Model 사전 훈련, 미세 조정 및 Vision Language Model 훈련과 같은 다양한 작업에서 거의 손실이 없는 성능을 달성합니다. 또한 COAT은 BF16 대비 1.43배의 훈련 속도 향상을 달성하며 TransformerEngine의 속도 향상과 비슷하거나 능가합니다. COAT은 적은 수의 GPU에서 대규모 모델의 전체 매개변수 훈련을 효율적으로 가능하게 하며 분산 훈련 환경에서 배치 크기를 두 배로 증가시키는 것을 용이하게 합니다. 코드는 https://github.com/NVlabs/COAT에서 확인할 수 있습니다.

English

FP8 training has emerged as a promising method for improving training efficiency. Existing frameworks accelerate training by applying FP8 computation to linear layers while leaving optimizer states and activations in higher precision, which fails to fully optimize memory usage. This paper introduces COAT (Compressing Optimizer States and Activations for FP8 Training), a novel FP8 training framework designed to significantly reduce memory footprint when training large models. COAT addresses current limitations through two key innovations: (1) Dynamic Range Expansion, which aligns optimizer state distributions more closely with the FP8 representation range, thereby reducing quantization error, and (2) Mixed-Granularity Activation Quantization, which optimizes activation memory using a combination of per-tensor and per-group quantization strategies. Experiments demonstrate that COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16 while achieving nearly lossless performance across various tasks, such as Large Language Model pretraining and fine-tuning and Vision Language Model training. COAT also achieves a 1.43x end-to-end training speedup compared to BF16, performing on par with or surpassing TransformerEngine's speedup. COAT enables efficient full-parameter training of large models on fewer GPUs, and facilitates doubling the batch size in distributed training settings, providing a practical solution for scaling large-scale model training. The code is available at https://github.com/NVlabs/COAT.

COAT: 메모리 효율적인 FP8 훈련을 위한 옵티마이저 상태와 활성화 압축

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

초록

Summary

Support