QuEST:具有1比特权重和激活的LLM稳定训练
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
February 7, 2025
作者: Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh
cs.AI
摘要
减少大型语言模型(LLMs)巨大成本的一种方法是在训练或部署过程中使用量化或稀疏表示。尽管后训练压缩方法非常流行,但直接在这些表示上进行训练以获得更准确的压缩模型的问题,即量化感知训练(QAT),仍然是一个开放问题:例如,最近的一项研究(arXiv:2411.04330v2)确定了使用QAT进行训练的“最佳”比特宽度,同时保持与标准FP16/BF16精度相竞争的准确性,为8位权重和激活。我们通过一种名为QuEST的新方法推进了这一最新技术,该方法与FP16具有帕累托竞争力,即在更低的模型大小下提供更好的准确性,同时训练具有4位或更少位权重和激活的模型。此外,QuEST允许使用1位权重和激活进行稳定训练。QuEST通过改进QAT方法的两个关键方面实现了这一点:(1)通过Hadamard归一化和均方误差最优拟合准确快速地量化权重和激活的(连续)分布;(2)基于显式最小化在量化状态上计算的嘈杂梯度与“真实”(但未知)全精度梯度之间的误差的新信任梯度估计器的思想。在Llama类型的架构上进行的实验表明,QuEST在整个硬件支持的精度范围内诱导出稳定的扩展规律,并且可以扩展到稀疏表示。我们提供了GPU内核支持,显示由QuEST生成的模型可以高效执行。我们的代码可在https://github.com/IST-DASLab/QuEST 上找到。
English
One approach to reducing the massive costs of large language models (LLMs) is
the use of quantized or sparse representations for training or deployment.
While post-training compression methods are very popular, the question of
obtaining even more accurate compressed models by directly training over such
representations, i.e., Quantization-Aware Training (QAT), is still open: for
example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at
which models can be trained using QAT, while staying accuracy-competitive with
standard FP16/BF16 precision, at 8-bits weights and activations.
We advance this state-of-the-art via a new method called QuEST, which is
Pareto-competitive with FP16, i.e., it provides better accuracy at lower model
size, while training models with weights and activations in 4-bits or less.
Moreover, QuEST allows stable training with 1-bit weights and activations.
QuEST achieves this by improving two key aspects of QAT methods: (1) accurate
and fast quantization of the (continuous) distributions of weights and
activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust
gradient estimator based on the idea of explicitly minimizing the error between
the noisy gradient computed over quantized states and the "true" (but unknown)
full-precision gradient. Experiments on Llama-type architectures show that
QuEST induces stable scaling laws across the entire range of hardware-supported
precisions, and can be extended to sparse representations. We provide GPU
kernel support showing that models produced by QuEST can be executed
efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.Summary
AI-Generated Summary