TransMamba:在Transformer与Mamba间灵活切换
TransMamba: Flexibly Switching between Transformer and Mamba
March 31, 2025
作者: Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, Jie Jiang
cs.AI
摘要
Transformer是现代大型语言模型的基石,但其二次方的计算复杂度限制了长序列处理的效率。近期,Mamba作为一种具有线性复杂度的状态空间模型(SSM)取得了进展,展现出显著的效率提升潜力,但在上下文学习稳定性和多任务泛化方面存在不足。本文提出TransMamba,一个创新框架,通过共享参数矩阵(如QKV和CBx)将Transformer与Mamba统一起来,从而能够在不同令牌长度和层级间动态切换注意力机制与SSM机制。我们设计了记忆转换器,通过将注意力输出转换为SSM兼容的状态,在发生转换的TransPoints处确保信息流畅传递,以此桥接Transformer与Mamba。此外,TransPoint调度策略也得到了深入探索以进一步提升性能。通过大量实验,我们证实了TransMamba在训练效率和性能上均优于基线模型,并验证了Transformer与Mamba范式之间更深层次的一致性,为下一代序列建模提供了一个可扩展的解决方案。
English
Transformers are the cornerstone of modern large language models, but their
quadratic computational complexity limits efficiency in long-sequence
processing. Recent advancements in Mamba, a state space model (SSM) with linear
complexity, offer promising efficiency gains but suffer from unstable
contextual learning and multitask generalization. This paper proposes
TransMamba, a novel framework that unifies Transformer and Mamba through shared
parameter matrices (e.g., QKV and CBx), and thus could dynamically switch
between attention and SSM mechanisms at different token lengths and layers. We
design the Memory converter to bridge Transformer and Mamba by converting
attention outputs into SSM-compatible states, ensuring seamless information
flow at TransPoints where the transformation happens. The TransPoint scheduling
is also thoroughly explored for further improvements. We conducted extensive
experiments demonstrating that TransMamba achieves superior training efficiency
and performance compared to baselines, and validated the deeper consistency
between Transformer and Mamba paradigms, offering a scalable solution for
next-generation sequence modeling.Summary
AI-Generated Summary