ChatPaper.aiChatPaper

在LLM量化中的准确性和性能权衡:给我BF16,否则我选择死亡

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

November 4, 2024
作者: Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh
cs.AI

摘要

尽管大型语言模型(LLM)量化在推断加速方面颇受欢迎,但关于各种量化格式所涉及的准确性和性能折衷仍存在重大不确定性。我们展示了一项全面的实证研究,评估了流行的量化格式(FP8、INT8、INT4)在学术基准和实际任务中的准确性,涵盖了整个Llama-3.1模型系列。此外,我们的研究还检验了量化模型生成的文本与其未压缩对应物之间的差异。除了基准测试,我们还提出了一些量化改进,使我们能够获得最先进的准确性恢复结果。我们的研究涵盖了超过50万个个体评估,得出了几个关键发现:(1)FP8权重和激活量化(W8A8-FP)在所有模型规模上均无损失,(2)INT8权重和激活量化(W8A8-INT)在适当调整时,准确性下降仅为1-3%,令人惊讶,(3)INT4仅权重量化(W4A16-INT)与8位整数权重和激活量化具有竞争力。为了解决在特定部署环境中选择“最佳”格式的问题,我们使用流行的开源vLLM框架在各种GPU架构上进行推断性能分析。我们发现W4A16在同步部署方面提供了最佳的成本效益,并且在中档GPU上进行异步部署时也是如此。同时,W8A8格式在高端GPU上进行中大型模型的异步“连续批处理”部署方面表现出色。我们的结果为跨规模和性能要求部署量化LLM提供了一套实用指南。
English
Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.

Summary

AI-Generated Summary

PDF513November 13, 2024