MixLLM：出力特徴量と高効率システム設計の間でのグローバル混合精度を用いたLLMの量子化

要旨

量子化は、LLMをより小さなサイズに圧縮するための最も効果的な手法の1つとなっています。しかしながら、既存の量子化ソリューションは、非常に重要な精度の低下またはシステムの非効率性のいずれかの制限を示しています。本論文では、一般的な量子化原則が精度、メモリ消費、およびシステム効率の三角形に与える影響について包括的な分析を行います。私たちは、異なる出力特徴がモデル内で異なる重要性を持つという洞察に基づき、出力特徴間の混合精度量子化の新しい最適化空間を探求するMixLLMを提案します。MixLLMは、各単一層内ではなく、グローバルビューで高い重要性を持つ出力特徴を特定し、最も必要とする出力特徴により大きなビット幅を割り当てることで、良好な精度と低メモリ消費を実現します。我々は、高い精度とシステム効率をもたらすアルゴリズム-システム共同設計の量子化構成の最適なポイントを提示します。システムの課題に対処するために、int8 Tensor Coreを簡単に利用するための2段階の量子化解除を設計し、量子化解除のオーバーヘッドを大幅に削減するための高速データ型変換を行い、メモリアクセス、量子化解除、およびMatMulを最適に重ね合わせるソフトウェアパイプラインを提示します。広範な実験では、SOTAのPPL増加がLlama 3.1 70Bにおいて約0.5から0.2未満に削減され、平均MMLU-Proは3つの人気モデルのSOTAより0.93改善されます。MixLLMは、卓越した精度に加えて、最先端のシステム効率も実現しています。

English

Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or system inefficiency. In this paper, we make a comprehensive analysis of the general quantization principles on their effect to the triangle of accuracy, memory consumption and system efficiency. We propose MixLLM that explores the new optimization space of mixed-precision quantization between output features based on the insight that different output features matter differently in the model. MixLLM identifies the output features with high salience in the global view rather than within each single layer, effectively assigning the larger bit-width to output features that need it most to achieve good accuracy with low memory consumption. We present the sweet spot of quantization configuration of algorithm-system co-design that leads to high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the int8 Tensor Core easily and fast data type conversion to reduce dequantization overhead significantly, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10% more bits, the PPL increasement can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while on average MMLU-Pro improves by 0.93 over the SOTA of three popular models. In addition to its superior accuracy, MixLLM also achieves state-of-the-art system efficiency.

MixLLM：出力特徴量と高効率システム設計の間でのグローバル混合精度を用いたLLMの量子化

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

要旨

Summary

Support

Support