通過群組感知SSM剪枝實現高效混合語言模型壓縮

摘要

結合注意力機制與狀態空間模型（SSMs）的混合大型語言模型架構，在準確性和運行時性能上達到了業界領先水平。近期研究表明，對僅依賴注意力機制的模型進行壓縮與蒸餾，能夠以極低的訓練成本獲得體積更小、精度更高的模型。在本研究中，我們探討了對混合架構進行壓縮的有效性。我們提出了一種新穎的群組感知剪枝策略，該策略在保持SSM模塊結構完整性的同時，也維護了其序列建模能力。此外，我們證明了相比傳統方法，此類SSM剪枝對於提升模型精度和推理速度的必要性。我們的壓縮方案結合了SSM、前饋網絡（FFN）、嵌入維度及層級剪枝，隨後採用基於知識蒸餾的再訓練，類似於MINITRON技術。通過這一方法，我們將擁有80億參數的Nemotron-H混合模型壓縮至40億參數，且訓練token數量最多減少40倍。最終得到的模型在保持同等規模模型精度的基礎上，實現了推理速度的兩倍提升，顯著推進了帕累托前沿。

English

Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.

通過群組感知SSM剪枝實現高效混合語言模型壓縮

Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

摘要

Summary

Support

Support