QuEST：具有1比特权重和激活的LLM稳定训练

摘要

减少大型语言模型（LLMs）巨大成本的一种方法是在训练或部署过程中使用量化或稀疏表示。尽管后训练压缩方法非常流行，但直接在这些表示上进行训练以获得更准确的压缩模型的问题，即量化感知训练（QAT），仍然是一个开放问题：例如，最近的一项研究（arXiv:2411.04330v2）确定了使用QAT进行训练的“最佳”比特宽度，同时保持与标准FP16/BF16精度相竞争的准确性，为8位权重和激活。我们通过一种名为QuEST的新方法推进了这一最新技术，该方法与FP16具有帕累托竞争力，即在更低的模型大小下提供更好的准确性，同时训练具有4位或更少位权重和激活的模型。此外，QuEST允许使用1位权重和激活进行稳定训练。QuEST通过改进QAT方法的两个关键方面实现了这一点：（1）通过Hadamard归一化和均方误差最优拟合准确快速地量化权重和激活的（连续）分布；（2）基于显式最小化在量化状态上计算的嘈杂梯度与“真实”（但未知）全精度梯度之间的误差的新信任梯度估计器的思想。在Llama类型的架构上进行的实验表明，QuEST在整个硬件支持的精度范围内诱导出稳定的扩展规律，并且可以扩展到稀疏表示。我们提供了GPU内核支持，显示由QuEST生成的模型可以高效执行。我们的代码可在https://github.com/IST-DASLab/QuEST 上找到。

English

One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.

QuEST：具有1比特权重和激活的LLM稳定训练

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

摘要

Support