VPTQ：針對大型語言模型的極低位元向量事後訓練量化

摘要

模型尺寸的擴展顯著挑戰了大型語言模型（LLMs）的部署和推斷。由於LLM權重中的冗余性，最近的研究集中在將僅權重量化推向極低比特（甚至降至2比特）。這降低了內存需求，優化了存儲成本，並在推斷期間降低了內存帶寬需求。然而，由於數值表示的限制，傳統基於標量的權重量化難以實現如此極低比特。最近對LLMs的向量量化（VQ）的研究表明，通過使用查找表將向量壓縮為索引，具有實現極低比特模型量化的潛力。在本文中，我們介紹了用於LLMs極低比特量化的向量事後訓練量化（VPTQ）。我們使用二階優化來制定LLM VQ問題，通過解決優化問題來引導我們的量化算法設計。我們進一步使用獨立通道的二階優化來對權重進行細化，以實現粒度化的VQ。此外，通過分解優化問題，我們提出了一個簡潔有效的碼本初始化算法。我們還將VPTQ擴展到支持殘差和異常值量化，從而提高模型準確性並進一步壓縮模型。我們的實驗結果顯示，VPTQ在LLaMA-2上將模型量化困惑度降低了0.01-0.34，在Mistral-7B上降低了0.38-0.68，在LLaMA-3上降低了4.41-7.34，相對於2比特的SOTA，平均準確性提高了0.79-1.5%在LLaMA-2上，1%在Mistral-7B上，在LLaMA-3上的QA任務平均提高了11-22%。我們僅利用了10.4-18.6%的量化算法執行時間，使推斷吞吐量相對於SOTA增加了1.6-1.8倍。

English

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization. We further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ. In addition, by decomposing the optimization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model. Our experimental results show that VPTQ reduces model quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, 4.41-7.34 on LLaMA-3 over SOTA at 2-bit, with an average accuracy improvement of 0.79-1.5% on LLaMA-2, 1% on Mistral-7B, 11-22% on LLaMA-3 on QA tasks on average. We only utilize 10.4-18.6% of the quantization algorithm execution time, resulting in a 1.6-1.8times increase in inference throughput compared to SOTA.

VPTQ：針對大型語言模型的極低位元向量事後訓練量化

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

摘要

Summary

Support

Support