QuEST: 1-비트 가중치와 활성화를 사용한 LLMs의 안정적인 훈련

초록

대형 언어 모델(Large Language Models, LLMs)의 막대한 비용을 줄이는 한 가지 방법은 훈련 또는 배포를 위해 양자화된 또는 희소한 표현을 사용하는 것입니다. 후훈련 압축 방법이 매우 인기가 있지만, 이러한 표현 위에서 직접 훈련하여 더 정확한 압축된 모델을 얻는 문제, 즉 양자화 인식 훈련(Quantization-Aware Training, QAT)은 여전히 미해결되어 있습니다. 최근 연구(arXiv:2411.04330v2)에서는 모델이 QAT를 사용하여 훈련될 수 있는 "최적" 비트 폭을 제시하고, 표준 FP16/BF16 정밀도와 정확도 경쟁력을 유지하면서 8비트 가중치와 활성화로 훈련될 수 있다고 합니다. 저희는 QuEST라는 새로운 방법을 통해 이 최신 기술을 발전시켰습니다. QuEST는 FP16과 Pareto 경쟁력을 갖으며, 즉 더 낮은 모델 크기에서 더 나은 정확도를 제공하면서 4비트 이하의 가중치와 활성화로 모델을 훈련합니다. 더불어 QuEST는 1비트 가중치와 활성화로 안정적인 훈련을 가능하게 합니다. QuEST는 QAT 방법의 두 가지 핵심 측면을 개선하여 이를 달성합니다: (1) Hadamard 정규화와 MSE-최적 적합을 통한 가중치와 활성화의 (연속적인) 분포의 정확하고 빠른 양자화; (2) 양자화된 상태에서 계산된 잡음이 있는 기울기와 "진실" (하지만 알 수 없는) 전체 정밀 기울기 사이의 오차를 명시적으로 최소화하는 새로운 신뢰 기울기 추정기를 기반으로 합니다. Llama 유형 아키텍처에서의 실험 결과는 QuEST가 하드웨어가 지원하는 정밀도 범위 전체에 걸쳐 안정적인 스케일링 법칙을 유도하며, 희소 표현으로 확장할 수 있다는 것을 보여줍니다. 저희는 QuEST에서 생성된 모델이 효율적으로 실행될 수 있음을 보여주는 GPU 커널 지원을 제공합니다. 저희의 코드는 https://github.com/IST-DASLab/QuEST에서 확인하실 수 있습니다.

English

One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.

QuEST: 1-비트 가중치와 활성화를 사용한 LLMs의 안정적인 훈련

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

초록

Support