EvoPress: 진화 검색을 통한 최적의 동적 모델 압축으로의 진화

초록

대형 언어 모델(Large Language Models, LLMs)의 높은 계산 비용으로 인해 LLM 압축에 대한 연구가 활발히 진행되어 왔습니다. 양자화, 희소화 또는 구조화된 가지치기와 같은 방법을 통해 이루어진 LLM 압축의 새로운 분야는 동적이고 비균일한 압축 방법에 의해 제공됩니다. 이러한 방법은 압축 수준(예: 희소성)을 블록별이나 심지어 레이어별로 조정하여 정확도 손실을 최소화하면서 전역 압축 임계값을 보장합니다. 그러나 현재의 방법은 주어진 레이어의 "중요성"을 식별하기 위해 휴리스틱에 의존하며, 이는 오차 단조성과 같은 가정에 기초합니다. 즉, 끝에서 끝으로 모델 압축 오류가 레이어별 오류의 합에 비례한다는 것입니다. 본 논문에서는 이 분야를 재검토하고, 주어진 입력 범위에서 증명 가능하게 최적인 동적 압축을 제안합니다. 일반적으로 LLM에서 오류 단조성이 성립하지 않는다는 동기부여적 관찰에서 시작하여, 낮은 레이어별 오류 합을 갖는 압축 모델이 더 높은 오류 합을 갖는 모델보다 성능이 떨어질 수 있다는 문제를 해결하기 위해 EvoPress라는 새로운 일반적인 진화 프레임워크를 제안합니다. EvoPress는 증명 가능한 수렴성과 낮은 샘플 및 평가 복잡성을 갖습니다. 우리는 이론적 보증이 고성능의 실용적 성능으로 이어지는 것을 보여주며, Llama, Mistral 및 Phi 모델의 동적 압축에 대해 매우 경쟁력 있는 실적을 달성했습니다. EvoPress를 통해 우리는 구조적 가지치기(블록/레이어 삭제), 비구조적 희소성, 동적 비트폭을 사용한 양자화와 같은 모든 압축 방법에서 새로운 최고 수준의 결과를 제시했습니다. 코드는 https://github.com/IST-DASLab/EvoPress에서 확인할 수 있습니다.

English

The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on heuristics for identifying the "importance" of a given layer towards the loss, based on assumptions such as error monotonicity, i.e. that the end-to-end model compression error is proportional to the sum of layer-wise errors. In this paper, we revisit this area, and propose a new and general approach for dynamic compression that is provably optimal in a given input range. We begin from the motivating observation that, in general, error monotonicity does not hold for LLMs: compressed models with lower sum of per-layer errors can perform worse than models with higher error sums. To address this, we propose a new general evolutionary framework for dynamic LLM compression called EvoPress, which has provable convergence, and low sample and evaluation complexity. We show that these theoretical guarantees lead to highly competitive practical performance for dynamic compression of Llama, Mistral and Phi models. Via EvoPress, we set new state-of-the-art results across all compression approaches: structural pruning (block/layer dropping), unstructured sparsity, as well as quantization with dynamic bitwidths. Our code is available at https://github.com/IST-DASLab/EvoPress.

EvoPress: 진화 검색을 통한 최적의 동적 모델 압축으로의 진화

EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search

초록

Summary

Support