FP4 양자화를 사용하여 대형 언어 모델 훈련 최적화하기

초록

대규모 언어 모델 (LLM)을 훈련하는 컴퓨팅 요구가 증가함에 따라 더 효율적인 방법이 필요합니다. 양자화된 훈련은 이러한 비용을 줄이기 위해 낮은 비트 산술 연산을 가능하게 함으로써 유망한 해결책을 제시합니다. FP8 정밀도는 실행 가능성을 입증했지만, FP4를 활용하는 것은 상당한 양자화 오차와 제한된 표현 능력으로 인해 여전히 어려운 과제입니다. 본 연구는 LLM을 위한 첫 번째 FP4 훈련 프레임워크를 소개하며, 두 가지 주요 혁신을 통해 이러한 도전에 대응합니다: 정확한 가중치 업데이트를 위한 미분 가능한 양자화 추정기 및 활성화 붕괴를 방지하기 위한 이상치 클램핑 및 보상 전략. 안정성을 보장하기 위해 프레임워크는 혼합 정밀도 훈련 체계와 벡터별 양자화를 통합합니다. 실험 결과는 저평가가 미미하며, 100B 토큰까지 훈련된 13B-매개변수 LLM에 효과적으로 확장되는 FP4 프레임워크가 BF16 및 FP8과 유사한 정확도를 달성함을 보여줍니다. FP4를 지원하는 차세대 하드웨어의 등장으로, 우리의 프레임워크는 효율적인 초저 정밀도 훈련을 위한 기반을 마련합니다.

English

The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.

FP4 양자화를 사용하여 대형 언어 모델 훈련 최적화하기

Optimizing Large Language Model Training Using FP4 Quantization

초록

Support