BitStack：用于压缩大型语言模型的细粒度尺寸控制在可变内存环境中

摘要

大型语言模型（LLMs）已经彻底改变了许多应用，但它们的部署仍受到本地设备内存限制的挑战。尽管缩放定律增强了LLM的功能，但主要瓶颈已经从能力转变为可用性，强调了对高效内存管理的需求。传统的压缩方法，如量化，通常需要预定义的压缩比和针对每种设置的单独压缩过程，使其在可变内存环境中的部署变得复杂。在本文中，我们介绍了BitStack，这是一种新颖的、无需训练的权重压缩方法，可以在内存使用量和模型性能之间实现兆字节级的权衡。通过利用权重分解，BitStack可以动态调整模型大小，实现在运行内存和存储设备之间的最小传输。我们的方法在考虑每个参数的重要性的同时，迭代地分解权重矩阵，导致每次分解迭代中每个参数残差块约为1比特。这些块被排序并堆叠在存储中作为基本传输单元，根据当前内存可用性加载不同数量。通过在各种任务上进行广泛实验，我们证明，尽管提供了精细的尺寸控制，BitStack始终能够与强大的量化基线相匹配甚至超越，特别是在极端压缩比下。据我们所知，这是第一个有效地弥合了与量化等实用压缩技术之间差距的基于分解的方法。代码可在https://github.com/xinghaow99/BitStack找到。

English

Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from capability to availability, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce BitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at https://github.com/xinghaow99/BitStack.

BitStack：用于压缩大型语言模型的细粒度尺寸控制在可变内存环境中

BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

摘要

Summary

Support

Support