EvoPress：透過進化搜尋朝向最佳動態模型壓縮

摘要

大型語言模型（LLMs）的高計算成本已導致對LLM壓縮的研究激增，通過量化、稀疏化或結構剪枝等方法。這一領域的新前沿在於動態、非均勻壓縮方法，這些方法根據需要調整每個區塊或甚至每個層的壓縮級別（例如稀疏度），以最小化準確性損失，同時保證全局壓縮閾值。然而，目前的方法依賴於啟發式方法來識別給定層對損失的“重要性”，基於諸如錯誤單調性的假設，即端到端模型壓縮錯誤與逐層錯誤之和成正比。在本文中，我們重新審視了這一領域，並提出了一種新的通用方法，用於動態壓縮，可以在給定的輸入範圍內證明是最優的。我們從激發觀察開始，一般而言，對於LLMs，錯誤單調性並不成立：具有較低逐層錯誤總和的壓縮模型可能表現比具有較高錯誤總和的模型更差。為了應對這一問題，我們提出了一種名為EvoPress的新型通用動態LLM壓縮框架，具有可證明的收斂性，以及低樣本和評估複雜度。我們展示這些理論保證導致EvoPress在Llama、Mistral和Phi模型的動態壓縮方面具有極具競爭力的實際性能。通過EvoPress，我們在所有壓縮方法中設定了新的最新技術成果：結構剪枝（區塊/層丟棄）、非結構稀疏性，以及具有動態位寬的量化。我們的代碼可在https://github.com/IST-DASLab/EvoPress 上找到。

English

The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on heuristics for identifying the "importance" of a given layer towards the loss, based on assumptions such as error monotonicity, i.e. that the end-to-end model compression error is proportional to the sum of layer-wise errors. In this paper, we revisit this area, and propose a new and general approach for dynamic compression that is provably optimal in a given input range. We begin from the motivating observation that, in general, error monotonicity does not hold for LLMs: compressed models with lower sum of per-layer errors can perform worse than models with higher error sums. To address this, we propose a new general evolutionary framework for dynamic LLM compression called EvoPress, which has provable convergence, and low sample and evaluation complexity. We show that these theoretical guarantees lead to highly competitive practical performance for dynamic compression of Llama, Mistral and Phi models. Via EvoPress, we set new state-of-the-art results across all compression approaches: structural pruning (block/layer dropping), unstructured sparsity, as well as quantization with dynamic bitwidths. Our code is available at https://github.com/IST-DASLab/EvoPress.

EvoPress：透過進化搜尋朝向最佳動態模型壓縮

EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search

摘要

Summary

Support

Support