BitStack：壓縮大型語言模型的細粒度大小控制在可變記憶體環境中

摘要

大型語言模型（LLMs）已經在許多應用中引起了革命，但它們的部署仍受到本地設備的記憶限制的挑戰。雖然縮放定律增強了LLM的功能，但主要瓶頸已從功能轉移到可用性，強調了對高效記憶管理的需求。傳統的壓縮方法，如量化，通常需要預定義的壓縮比率和為每個設置單獨的壓縮過程，使其在可變記憶環境中的部署變得複雜。在本文中，我們介紹了BitStack，這是一種新穎的、無需訓練的權重壓縮方法，可以在記憶使用量和模型性能之間實現兆字節級的折衷。通過利用權重分解，BitStack可以動態調整模型大小，並在運行記憶體和存儲設備之間進行最小的傳輸。我們的方法通過考慮每個參數的重要性，迭代地分解權重矩陣，每次分解迭代中的每個參數殘差塊約為1位元。這些塊被排序並堆疊在存儲中作為基本傳輸單元，根據當前記憶體可用性加載不同數量。在各種任務上進行的大量實驗表明，儘管提供了細粒度的大小控制，BitStack始終與或優於強大的量化基線，特別是在極端壓縮比率下。據我們所知，這是第一個有效地將基於分解的方法與量化等實用壓縮技術有效連接的方法。代碼可在https://github.com/xinghaow99/BitStack找到。

English

Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from capability to availability, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce BitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at https://github.com/xinghaow99/BitStack.

BitStack：壓縮大型語言模型的細粒度大小控制在可變記憶體環境中

BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

摘要

Summary

Support

Support