在LLM量化中的準確性和性能之間的權衡：給我BF16還是給我死？

摘要

儘管大型語言模型（LLM）量化在推理加速方面很受歡迎，但對於與各種量化格式相關的準確性和性能折衷仍存在重大不確定性。我們提出了一項全面的實證研究，評估了流行的量化格式（FP8、INT8、INT4）在學術基準和實際任務中的量化準確性，並應用在整個Llama-3.1模型系列上。此外，我們的研究還檢驗了量化模型生成的文本與其未壓縮對應物之間的差異。除了基準測試外，我們還提出了幾項量化改進，使我們獲得了最先進的準確性恢復結果。我們的研究涵蓋了超過50萬個個別評估，得出了幾個關鍵發現：（1）FP8權重和激活量化（W8A8-FP）在所有模型規模上都是無損的，（2）INT8權重和激活量化（W8A8-INT），在適當調整時，產生了令人驚訝的低1-3%的準確性降級，以及（3）INT4僅權重量化（W4A16-INT）與8位整數權重和激活量化相競爭。為了解決在特定部署環境中選擇“最佳”格式的問題，我們使用流行的開源vLLM框架在各種GPU架構上進行推理性能分析。我們發現，W4A16在同步部署方面提供了最佳的成本效益，而對於中階GPU的異步部署而言，W8A8格式表現出色。同時，W8A8格式在高端GPU上中大型模型的異步“連續批處理”部署中表現卓越。我們的結果為跨規模和性能要求部署量化LLM提供了一套實用指南。

English

Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.

在LLM量化中的準確性和性能之間的權衡：給我BF16還是給我死？

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

摘要

Summary

Support

Support