마트료시카 양자화

초록

모델 가중치의 양자화는 대규모 모델의 통신 및 추론 비용을 줄이는 데 중요합니다. 그러나 모델을 양자화하는 것은 특히 int4 또는 int2와 같은 낮은 정밀도로 양자화하는 경우 모델 품질에 대한 트레이드오프가 필요합니다. 특히 int2는 모델 품질을 심각하게 저하시키는 것으로 알려져 있습니다. 결과적으로 실무자들은 종종 다양한 양자화 수준을 갖는 여러 모델을 유지하거나 품질-지연 트레이드오프를 가장 잘 충족하는 단일 모델을 제공해야 합니다. 반면, int8과 같은 정수 데이터 유형은 작은 비트 폭 정수인 int4 또는 int2가 가장 중요한 비트 내에 중첩된 구조를 가지고 있습니다. 본 논문은 Matryoshka Quantization (MatQuant)이라는 새로운 다중 스케일 양자화 기술을 제안합니다. 이 기술은 여러 양자화된 모델이 필요한 문제에 대응합니다. MatQuant을 통해 훈련하고 유지해야 하는 모델을 하나만 사용할 수 있으며, 이 모델은 다양한 정밀도 수준에서 제공될 수 있습니다. 또한 MatQuant이 제공하는 공동 훈련 및 공동 증류 규제 덕분에 MatQuant에서 추출된 int2 정밀도 모델은 QAT 또는 OmniQuant과 같은 기술을 사용한 표준 int2 양자화보다 최대 10% 더 정확할 수 있습니다. 이는 모델 양자화에서 상당한 진전을 나타내며, 동일한 레시피를 사용할 때 int2 FFN-양자화된 Gemma-2 9B 모델이 int8 FFN-양자화된 Gemma-2 2B 모델보다 정확할 것을 입증하고 있습니다.

English

Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. This paper proposes Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models. It allows training and maintaining just one model, which can then be served at different precision levels. Furthermore, due to the co-training and co-distillation regularization provided by MatQuant, the int2 precision models extracted by MatQuant can be up to 10% more accurate than standard int2 quantization (using techniques like QAT or OmniQuant). This represents significant progress in model quantization, demonstrated by the fact that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model.