通過群組感知SSM剪枝實現高效混合語言模型壓縮
Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning
April 15, 2025
作者: Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Marcin Chochowski, Yashaswi Karnati, Raviraj Joshi, Ameya Sunil Mahabaleshwarkar, Zijia Chen, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
cs.AI
摘要
結合注意力機制與狀態空間模型(SSMs)的混合大型語言模型架構,在準確性和運行時性能上達到了業界領先水平。近期研究表明,對僅依賴注意力機制的模型進行壓縮與蒸餾,能夠以極低的訓練成本獲得體積更小、精度更高的模型。在本研究中,我們探討了對混合架構進行壓縮的有效性。我們提出了一種新穎的群組感知剪枝策略,該策略在保持SSM模塊結構完整性的同時,也維護了其序列建模能力。此外,我們證明了相比傳統方法,此類SSM剪枝對於提升模型精度和推理速度的必要性。我們的壓縮方案結合了SSM、前饋網絡(FFN)、嵌入維度及層級剪枝,隨後採用基於知識蒸餾的再訓練,類似於MINITRON技術。通過這一方法,我們將擁有80億參數的Nemotron-H混合模型壓縮至40億參數,且訓練token數量最多減少40倍。最終得到的模型在保持同等規模模型精度的基礎上,實現了推理速度的兩倍提升,顯著推進了帕累托前沿。
English
Hybrid LLM architectures that combine Attention and State Space Models (SSMs)
achieve state-of-the-art accuracy and runtime performance. Recent work has
demonstrated that applying compression and distillation to Attention-only
models yields smaller, more accurate models at a fraction of the training cost.
In this work, we explore the effectiveness of compressing Hybrid architectures.
We introduce a novel group-aware pruning strategy that preserves the structural
integrity of SSM blocks and their sequence modeling capabilities. Furthermore,
we demonstrate the necessity of such SSM pruning to achieve improved accuracy
and inference speed compared to traditional approaches. Our compression recipe
combines SSM, FFN, embedding dimension, and layer pruning, followed by
knowledge distillation-based retraining, similar to the MINITRON technique.
Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B
parameters with up to 40x fewer training tokens. The resulting model surpasses
the accuracy of similarly-sized models while achieving 2x faster inference,
significantly advancing the Pareto frontier.Summary
AI-Generated Summary