通过分组感知SSM剪枝实现高效混合语言模型压缩
Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning
April 15, 2025
作者: Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Marcin Chochowski, Yashaswi Karnati, Raviraj Joshi, Ameya Sunil Mahabaleshwarkar, Zijia Chen, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
cs.AI
摘要
融合注意力机制与状态空间模型(SSMs)的混合大语言模型架构,在实现顶尖准确率的同时,也展现了卓越的运行性能。近期研究表明,对仅依赖注意力机制的模型进行压缩与蒸馏,能够以极低的训练成本获得更小、更精确的模型。本研究中,我们深入探讨了混合架构压缩的有效性。我们提出了一种新颖的组感知剪枝策略,该策略在保持SSM模块结构完整性的同时,也维护了其序列建模能力。此外,我们证实了相较于传统方法,此类SSM剪枝对于提升模型准确率与推理速度的必要性。我们的压缩方案综合了SSM、前馈网络(FFN)、嵌入维度及层级剪枝,随后采用基于知识蒸馏的再训练,类似于MINITRON技术。运用此方法,我们将拥有80亿参数的Nemotron-H混合模型压缩至40亿参数,训练令牌数最多减少40倍。最终得到的模型在保持同等规模模型准确率的基础上,实现了推理速度翻倍,显著推进了帕累托前沿。
English
Hybrid LLM architectures that combine Attention and State Space Models (SSMs)
achieve state-of-the-art accuracy and runtime performance. Recent work has
demonstrated that applying compression and distillation to Attention-only
models yields smaller, more accurate models at a fraction of the training cost.
In this work, we explore the effectiveness of compressing Hybrid architectures.
We introduce a novel group-aware pruning strategy that preserves the structural
integrity of SSM blocks and their sequence modeling capabilities. Furthermore,
we demonstrate the necessity of such SSM pruning to achieve improved accuracy
and inference speed compared to traditional approaches. Our compression recipe
combines SSM, FFN, embedding dimension, and layer pruning, followed by
knowledge distillation-based retraining, similar to the MINITRON technique.
Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B
parameters with up to 40x fewer training tokens. The resulting model surpasses
the accuracy of similarly-sized models while achieving 2x faster inference,
significantly advancing the Pareto frontier.Summary
AI-Generated Summary