70%体积，100%精度：通过动态长度浮点数实现无损大语言模型压缩，提升GPU推理效率

摘要

大型语言模型（LLMs）的规模迅速增长，为在资源受限硬件上的高效部署带来了显著挑战。本文提出了一种无损压缩框架——动态长度浮点数（DFloat11），该框架能将LLM的规模减少30%，同时保持输出与原始模型在比特级别上完全一致。DFloat11的灵感来源于LLMs中BFloat16权重表示的低熵特性，这揭示了现有存储格式的效率低下。通过应用熵编码，DFloat11根据权重频率分配动态长度编码，实现了近乎信息最优的压缩，且无精度损失。为了支持动态长度编码的高效推理，我们开发了一种定制GPU内核，用于快速在线解压。我们的设计包含以下要素：(i) 将内存密集型的查找表（LUTs）分解为适合GPU SRAM的紧凑LUTs，(ii) 采用两阶段内核，利用轻量级辅助变量协调线程的读写位置，(iii) 在Transformer块级别进行解压以最小化延迟。在包括Llama-3.1、Qwen-2.5和Gemma-3在内的最新模型上的实验验证了我们的假设，即DFloat11能在保持比特级精确输出的同时，实现约30%的模型规模缩减。与将未压缩模型部分卸载至CPU以满足内存限制的潜在替代方案相比，DFloat11在令牌生成上实现了1.9至38.8倍的吞吐量提升。在固定GPU内存预算下，DFloat11支持比未压缩模型长5.3至13.17倍的上下文长度。尤为突出的是，我们的方法使得810GB的Llama-3.1-405B模型能够在配备8x80GB GPU的单节点上实现无损推理。我们的代码和模型可在https://github.com/LeanModels/DFloat11获取。

English

Large Language Models (LLMs) have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) decomposition of memory-intensive lookup tables (LUTs) into compact LUTs that fit in GPU SRAM, (ii) a two-phase kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on recent models, including Llama-3.1, Qwen-2.5, and Gemma-3, validates our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit exact outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models. Notably, our method enables lossless inference of Llama-3.1-405B, an 810GB model, on a single node equipped with 8x80GB GPUs. Our code and models are available at https://github.com/LeanModels/DFloat11.

70%体积，100%精度：通过动态长度浮点数实现无损大语言模型压缩，提升GPU推理效率

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

摘要

Summary

Support

Support