MixLLM:LLM 量化,全局混合精度在输出特征和高效系统设计之间

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

December 19, 2024
作者: Zhen Zheng, Xiaonan Song, Chuanjie Liu
cs.AI

摘要

量化已成为将LLM压缩为更小尺寸的最有效方法之一。然而,现有的量化解决方案仍然存在明显的准确性下降或系统效率低下的局限性。本文全面分析了一般量化原则对准确性、内存消耗和系统效率三角关系的影响。我们提出了MixLLM,探索了基于新视角的混合精度量化优化空间,根据不同输出特征在模型中的重要性不同这一洞察。MixLLM在全局视角下识别具有高显著性的输出特征,而不是在每个单独层内部,有效地为最需要的输出特征分配更大的位宽,以实现在低内存消耗下获得良好准确性。我们提出了算法-系统协同设计的量化配置的最佳点,以实现高准确性和系统效率。为解决系统挑战,我们设计了两步去量化,以便轻松利用int8 Tensor Core和快速数据类型转换,从而显著减少去量化开销,并提出了软件流水线以最佳方式重叠内存访问、去量化和矩阵乘法。大量实验证明,仅增加10%的位数,PPL增长可以从SOTA的约0.5减少到Llama 3.1 70B的0.2以内,而MMLU-Pro平均提高了0.93,超过了三个流行模型的SOTA。除了其卓越的准确性外,MixLLM还实现了最先进的系统效率。
English
Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or system inefficiency. In this paper, we make a comprehensive analysis of the general quantization principles on their effect to the triangle of accuracy, memory consumption and system efficiency. We propose MixLLM that explores the new optimization space of mixed-precision quantization between output features based on the insight that different output features matter differently in the model. MixLLM identifies the output features with high salience in the global view rather than within each single layer, effectively assigning the larger bit-width to output features that need it most to achieve good accuracy with low memory consumption. We present the sweet spot of quantization configuration of algorithm-system co-design that leads to high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the int8 Tensor Core easily and fast data type conversion to reduce dequantization overhead significantly, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10% more bits, the PPL increasement can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while on average MMLU-Pro improves by 0.93 over the SOTA of three popular models. In addition to its superior accuracy, MixLLM also achieves state-of-the-art system efficiency.

Summary

AI-Generated Summary

PDF145December 23, 2024