母婴量化

Matryoshka Quantization

February 10, 2025

作者: Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati

cs.AI

摘要

量化模型权重对于减少大型模型的通信和推断成本至关重要。然而，将模型量化，特别是到像int4或int2这样的低精度，需要在模型质量上进行权衡；特别是int2已知会严重降低模型质量。因此，从业者通常被迫维护具有不同量化级别的多个模型，或提供一个最符合质量-延迟权衡的单个模型。另一方面，整数数据类型，如int8，固有地具有嵌套（Matryoshka）结构，其中较小位宽的整数，如int4或int2，嵌套在最重要的位中。本文提出了Matryoshka量化（MatQuant），这是一种新颖的多尺度量化技术，解决了需要多个量化模型的挑战。它允许训练和维护只一个模型，然后可以以不同精度级别提供服务。此外，由于MatQuant提供的共同训练和共同蒸馏正则化，由MatQuant提取的int2精度模型比标准int2量化（使用QAT或OmniQuant等技术）的模型准确度高达10%。这代表了模型量化方面的重大进展，事实证明，使用相同的方法，一个int2 FFN-量化的Gemma-2 9B模型比一个int8 FFN-量化的Gemma-2 2B模型更准确。

English

Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. This paper proposes Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models. It allows training and maintaining just one model, which can then be served at different precision levels. Furthermore, due to the co-training and co-distillation regularization provided by MatQuant, the int2 precision models extracted by MatQuant can be up to 10% more accurate than standard int2 quantization (using techniques like QAT or OmniQuant). This represents significant progress in model quantization, demonstrated by the fact that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model.

Summary

AI-Generated Summary