70%体积,100%精度:通过动态长度浮点数实现无损大语言模型压缩,提升GPU推理效率
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
April 15, 2025
作者: Tianyi Zhang, Yang Sui, Shaochen Zhong, Vipin Chaudhary, Xia Hu, Anshumali Shrivastava
cs.AI
摘要
大型语言模型(LLMs)的规模迅速增长,为在资源受限硬件上的高效部署带来了显著挑战。本文提出了一种无损压缩框架——动态长度浮点数(DFloat11),该框架能将LLM的规模减少30%,同时保持输出与原始模型在比特级别上完全一致。DFloat11的灵感来源于LLMs中BFloat16权重表示的低熵特性,这揭示了现有存储格式的效率低下。通过应用熵编码,DFloat11根据权重频率分配动态长度编码,实现了近乎信息最优的压缩,且无精度损失。为了支持动态长度编码的高效推理,我们开发了一种定制GPU内核,用于快速在线解压。我们的设计包含以下要素:(i) 将内存密集型的查找表(LUTs)分解为适合GPU SRAM的紧凑LUTs,(ii) 采用两阶段内核,利用轻量级辅助变量协调线程的读写位置,(iii) 在Transformer块级别进行解压以最小化延迟。在包括Llama-3.1、Qwen-2.5和Gemma-3在内的最新模型上的实验验证了我们的假设,即DFloat11能在保持比特级精确输出的同时,实现约30%的模型规模缩减。与将未压缩模型部分卸载至CPU以满足内存限制的潜在替代方案相比,DFloat11在令牌生成上实现了1.9至38.8倍的吞吐量提升。在固定GPU内存预算下,DFloat11支持比未压缩模型长5.3至13.17倍的上下文长度。尤为突出的是,我们的方法使得810GB的Llama-3.1-405B模型能够在配备8x80GB GPU的单节点上实现无损推理。我们的代码和模型可在https://github.com/LeanModels/DFloat11获取。
English
Large Language Models (LLMs) have grown rapidly in size, creating significant
challenges for efficient deployment on resource-constrained hardware. In this
paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression
framework that reduces LLM size by 30% while preserving outputs that are
bit-for-bit identical to the original model. DFloat11 is motivated by the low
entropy in the BFloat16 weight representation of LLMs, which reveals
significant inefficiency in existing storage format. By applying entropy
coding, DFloat11 assigns dynamic-length encodings to weights based on
frequency, achieving near information-optimal compression without any loss of
precision. To facilitate efficient inference with dynamic-length encodings, we
develop a custom GPU kernel for fast online decompression. Our design
incorporates the following: (i) decomposition of memory-intensive lookup tables
(LUTs) into compact LUTs that fit in GPU SRAM, (ii) a two-phase kernel for
coordinating thread read/write positions using lightweight auxiliary variables,
and (iii) transformer-block-level decompression to minimize latency.
Experiments on recent models, including Llama-3.1, Qwen-2.5, and Gemma-3,
validates our hypothesis that DFloat11 achieves around 30% model size reduction
while preserving bit-for-bit exact outputs. Compared to a potential alternative
of offloading parts of an uncompressed model to the CPU to meet memory
constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation.
With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context
lengths than uncompressed models. Notably, our method enables lossless
inference of Llama-3.1-405B, an 810GB model, on a single node equipped with
8x80GB GPUs. Our code and models are available at
https://github.com/LeanModels/DFloat11.Summary
AI-Generated Summary