通过后量化积分识别敏感权重

摘要

部署大型语言模型（LLMs）成本高昂。然而，训练后的权重量化技术能够通过压缩模型尺寸以适应有限内存，并节省带宽以加速推理，从而有效解决这一问题。鉴于并非所有权重维度同等重要，此类方法通常依赖于敏感度指标，该指标反映了权重元素对损失函数的逐点影响，并用于预处理原始权重以实现更优量化。本研究对敏感度指标的准确性进行了实证分析，发现现有的基于梯度和海森矩阵的指标存在显著偏差：它们低估了量化对损失函数的影响，偏差可达数个数量级，这主要归因于局部二阶近似（即泰勒公式中的梯度和海森项）的收敛半径过小。为解决此问题，我们提出了后量化积分（Post-quantization Integral, PQI），这是一种能够精细估计后验敏感度的精确指标。为进一步利用这一精确指标，我们提出了ReQuant框架，该框架简洁而强大，主要由两个“密集-稀疏”解耦组件构成：自适应异常值选择和逐步重要权重解耦。实验结果表明，ReQuant显著提升了现有训练后量化方法的性能，在Llama 3.2 1B模型上结合QTIP方法实现了2.66的困惑度提升。

English

Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization's impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, \ie, gradient and Hessian term in Taylor's formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.

通过后量化积分识别敏感权重

Identifying Sensitive Weights via Post-quantization Integral

摘要

Summary

Support

Support