通过分组感知SSM剪枝实现高效混合语言模型压缩

摘要

融合注意力机制与状态空间模型（SSMs）的混合大语言模型架构，在实现顶尖准确率的同时，也展现了卓越的运行性能。近期研究表明，对仅依赖注意力机制的模型进行压缩与蒸馏，能够以极低的训练成本获得更小、更精确的模型。本研究中，我们深入探讨了混合架构压缩的有效性。我们提出了一种新颖的组感知剪枝策略，该策略在保持SSM模块结构完整性的同时，也维护了其序列建模能力。此外，我们证实了相较于传统方法，此类SSM剪枝对于提升模型准确率与推理速度的必要性。我们的压缩方案综合了SSM、前馈网络（FFN）、嵌入维度及层级剪枝，随后采用基于知识蒸馏的再训练，类似于MINITRON技术。运用此方法，我们将拥有80亿参数的Nemotron-H混合模型压缩至40亿参数，训练令牌数最多减少40倍。最终得到的模型在保持同等规模模型准确率的基础上，实现了推理速度翻倍，显著推进了帕累托前沿。

English

Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.

通过分组感知SSM剪枝实现高效混合语言模型压缩

Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

摘要

Summary

Support

Support