BitStack:用于压缩大型语言模型的细粒度尺寸控制在可变内存环境中
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
October 31, 2024
作者: Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu
cs.AI
摘要
大型语言模型(LLMs)已经彻底改变了许多应用,但它们的部署仍受到本地设备内存限制的挑战。尽管缩放定律增强了LLM的功能,但主要瓶颈已经从能力转变为可用性,强调了对高效内存管理的需求。传统的压缩方法,如量化,通常需要预定义的压缩比和针对每种设置的单独压缩过程,使其在可变内存环境中的部署变得复杂。在本文中,我们介绍了BitStack,这是一种新颖的、无需训练的权重压缩方法,可以在内存使用量和模型性能之间实现兆字节级的权衡。通过利用权重分解,BitStack可以动态调整模型大小,实现在运行内存和存储设备之间的最小传输。我们的方法在考虑每个参数的重要性的同时,迭代地分解权重矩阵,导致每次分解迭代中每个参数残差块约为1比特。这些块被排序并堆叠在存储中作为基本传输单元,根据当前内存可用性加载不同数量。通过在各种任务上进行广泛实验,我们证明,尽管提供了精细的尺寸控制,BitStack始终能够与强大的量化基线相匹配甚至超越,特别是在极端压缩比下。据我们所知,这是第一个有效地弥合了与量化等实用压缩技术之间差距的基于分解的方法。代码可在https://github.com/xinghaow99/BitStack找到。
English
Large language models (LLMs) have revolutionized numerous applications, yet
their deployment remains challenged by memory constraints on local devices.
While scaling laws have enhanced LLM capabilities, the primary bottleneck has
shifted from capability to availability, emphasizing the need
for efficient memory management. Traditional compression methods, such as
quantization, often require predefined compression ratios and separate
compression processes for each setting, complicating deployment in variable
memory environments. In this paper, we introduce BitStack, a novel,
training-free weight compression approach that enables megabyte-level
trade-offs between memory usage and model performance. By leveraging weight
decomposition, BitStack can dynamically adjust the model size with minimal
transmission between running memory and storage devices. Our approach
iteratively decomposes weight matrices while considering the significance of
each parameter, resulting in an approximately 1-bit per parameter residual
block in each decomposition iteration. These blocks are sorted and stacked in
storage as basic transmission units, with different quantities loaded based on
current memory availability. Extensive experiments across a wide range of tasks
demonstrate that, despite offering fine-grained size control, BitStack
consistently matches or surpasses strong quantization baselines, particularly
at extreme compression ratios. To the best of our knowledge, this is the first
decomposition-based method that effectively bridges the gap to practical
compression techniques like quantization. Code is available at
https://github.com/xinghaow99/BitStack.Summary
AI-Generated Summary