ChatPaper.aiChatPaper

QuEST:具有1比特权重和激活的LLM稳定训练

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

February 7, 2025
作者: Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh
cs.AI

摘要

减少大型语言模型(LLMs)巨大成本的一种方法是在训练或部署过程中使用量化或稀疏表示。尽管后训练压缩方法非常流行,但直接在这些表示上进行训练以获得更准确的压缩模型的问题,即量化感知训练(QAT),仍然是一个开放问题:例如,最近的一项研究(arXiv:2411.04330v2)确定了使用QAT进行训练的“最佳”比特宽度,同时保持与标准FP16/BF16精度相竞争的准确性,为8位权重和激活。我们通过一种名为QuEST的新方法推进了这一最新技术,该方法与FP16具有帕累托竞争力,即在更低的模型大小下提供更好的准确性,同时训练具有4位或更少位权重和激活的模型。此外,QuEST允许使用1位权重和激活进行稳定训练。QuEST通过改进QAT方法的两个关键方面实现了这一点:(1)通过Hadamard归一化和均方误差最优拟合准确快速地量化权重和激活的(连续)分布;(2)基于显式最小化在量化状态上计算的嘈杂梯度与“真实”(但未知)全精度梯度之间的误差的新信任梯度估计器的思想。在Llama类型的架构上进行的实验表明,QuEST在整个硬件支持的精度范围内诱导出稳定的扩展规律,并且可以扩展到稀疏表示。我们提供了GPU内核支持,显示由QuEST生成的模型可以高效执行。我们的代码可在https://github.com/IST-DASLab/QuEST 上找到。
English
One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.

Summary

AI-Generated Summary

PDF433February 10, 2025