MixLLM: Quantização LLM com Mistura de Precisão Global entre Características de Saída e Design de Sistema Altamente Eficiente

Resumo

A quantização tornou-se uma das metodologias mais eficazes para comprimir LLMs em um tamanho menor. No entanto, as soluções de quantização existentes ainda apresentam limitações, seja em termos de queda de precisão não negligenciável ou ineficiência do sistema. Neste artigo, realizamos uma análise abrangente dos princípios gerais de quantização sobre seu efeito no triângulo de precisão, consumo de memória e eficiência do sistema. Propomos o MixLLM, que explora o novo espaço de otimização da quantização de precisão mista entre as características de saída com base na percepção de que diferentes características de saída têm importâncias diferentes no modelo. O MixLLM identifica as características de saída com alta relevância na visão global, em vez de dentro de cada camada individual, atribuindo efetivamente uma largura de bits maior às características de saída que mais precisam para alcançar boa precisão com baixo consumo de memória. Apresentamos o ponto ideal de configuração de quantização do algoritmo-sistema co-design que leva a uma alta precisão e eficiência do sistema. Para enfrentar o desafio do sistema, projetamos a dequantização em duas etapas para aproveitar o Tensor Core int8 facilmente e a conversão rápida de tipo de dados para reduzir significativamente o overhead de dequantização, e apresentamos o pipeline de software para sobrepor o acesso à memória, dequantização e o MatMul da melhor maneira. Experimentos extensos mostram que, com apenas 10% a mais de bits, o aumento do PPL pode ser reduzido de cerca de 0,5 no SOTA para dentro de 0,2 para o Llama 3.1 70B, enquanto em média o MMLU-Pro melhora em 0,93 em relação ao SOTA de três modelos populares. Além de sua precisão superior, o MixLLM também alcança eficiência do sistema de ponta.

English

Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or system inefficiency. In this paper, we make a comprehensive analysis of the general quantization principles on their effect to the triangle of accuracy, memory consumption and system efficiency. We propose MixLLM that explores the new optimization space of mixed-precision quantization between output features based on the insight that different output features matter differently in the model. MixLLM identifies the output features with high salience in the global view rather than within each single layer, effectively assigning the larger bit-width to output features that need it most to achieve good accuracy with low memory consumption. We present the sweet spot of quantization configuration of algorithm-system co-design that leads to high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the int8 Tensor Core easily and fast data type conversion to reduce dequantization overhead significantly, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10% more bits, the PPL increasement can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while on average MMLU-Pro improves by 0.93 over the SOTA of three popular models. In addition to its superior accuracy, MixLLM also achieves state-of-the-art system efficiency.

MixLLM: Quantização LLM com Mistura de Precisão Global entre Características de Saída e Design de Sistema Altamente Eficiente

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Resumo

Support