达尔文LM：大语言模型的进化式结构化剪枝

摘要

大型语言模型（LLMs）在各类自然语言处理任务中取得了显著成功。然而，其庞大的计算成本限制了其广泛应用，特别是在实时应用场景中。结构化剪枝提供了一种有效的解决方案，通过压缩模型并直接带来端到端的速度提升，且不受硬件环境限制。同时，模型的不同组件对剪枝表现出不同的敏感性，这要求进行非均匀的模型压缩。然而，剪枝方法不仅需要识别出有效的子结构，还需考虑压缩后的训练过程。为此，我们提出了\sysname，一种训练感知的结构化剪枝方法。\sysname基于进化搜索过程，在每一代中通过变异生成多个子代模型，并选择最适应者存活。为了评估训练后的效果，我们在子代群体中引入了一个轻量级的多步训练过程，逐步增加训练数据量，并在每个选择阶段淘汰表现不佳的模型。我们通过在Llama-2-7B、Llama-3.1-8B和Qwen-2.5-14B-Instruct上的广泛实验验证了该方法，实现了结构化剪枝的最先进性能。例如，\sysname在压缩后训练阶段所需训练数据量仅为ShearedLlama的五分之一，同时性能更优。

English

Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for non-uniform model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose \sysname, a method for training-aware structured pruning. \sysname builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, \sysname surpasses ShearedLlama while requiring 5times less training data during post-compression training.

达尔文LM：大语言模型的进化式结构化剪枝

DarwinLM: Evolutionary Structured Pruning of Large Language Models

摘要

Summary

Support