ChatPaper.aiChatPaper

70% 體積,100% 準確率:透過動態長度浮點數實現無損 LLM 壓縮,提升 GPU 推理效率

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

April 15, 2025
作者: Tianyi Zhang, Yang Sui, Shaochen Zhong, Vipin Chaudhary, Xia Hu, Anshumali Shrivastava
cs.AI

摘要

大型語言模型(LLMs)的規模迅速增長,這對在資源受限的硬體上進行高效部署帶來了重大挑戰。本文介紹了一種無損壓縮框架——動態長度浮點數(DFloat11),該框架能將LLM的體積減少30%,同時保持輸出與原始模型在比特級別上完全一致。DFloat11的動機來自於LLMs中BFloat16權重表示的低熵特性,這揭示了現有存儲格式的顯著低效性。通過應用熵編碼,DFloat11根據頻率為權重分配動態長度編碼,實現了近乎信息最優的壓縮,且無任何精度損失。為了支持動態長度編碼的高效推理,我們開發了一個定制的GPU內核,用於快速在線解壓縮。我們的設計包含以下幾點:(i) 將內存密集型的查找表(LUTs)分解為適合GPU SRAM的緊湊LUTs,(ii) 使用輕量級輔助變量協調線程讀寫位置的兩階段內核,以及(iii) 變壓器塊級別的解壓縮以最小化延遲。在包括Llama-3.1、Qwen-2.5和Gemma-3在內的近期模型上的實驗驗證了我們的假設,即DFloat11在保持比特級別精確輸出的同時,實現了約30%的模型體積縮減。與將未壓縮模型的部分數據卸載到CPU以滿足內存限制的潛在替代方案相比,DFloat11在令牌生成方面實現了1.9至38.8倍的更高吞吐量。在固定的GPU內存預算下,DFloat11使得上下文長度比未壓縮模型延長了5.3至13.17倍。值得注意的是,我們的方法使得在配備8x80GB GPU的單個節點上無損推理810GB的Llama-3.1-405B模型成為可能。我們的代碼和模型可在https://github.com/LeanModels/DFloat11獲取。
English
Large Language Models (LLMs) have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) decomposition of memory-intensive lookup tables (LUTs) into compact LUTs that fit in GPU SRAM, (ii) a two-phase kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on recent models, including Llama-3.1, Qwen-2.5, and Gemma-3, validates our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit exact outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models. Notably, our method enables lossless inference of Llama-3.1-405B, an 810GB model, on a single node equipped with 8x80GB GPUs. Our code and models are available at https://github.com/LeanModels/DFloat11.

Summary

AI-Generated Summary

PDF112April 18, 2025