TransMamba：在Transformer与Mamba间灵活切换

摘要

Transformer是现代大型语言模型的基石，但其二次方的计算复杂度限制了长序列处理的效率。近期，Mamba作为一种具有线性复杂度的状态空间模型（SSM）取得了进展，展现出显著的效率提升潜力，但在上下文学习稳定性和多任务泛化方面存在不足。本文提出TransMamba，一个创新框架，通过共享参数矩阵（如QKV和CBx）将Transformer与Mamba统一起来，从而能够在不同令牌长度和层级间动态切换注意力机制与SSM机制。我们设计了记忆转换器，通过将注意力输出转换为SSM兼容的状态，在发生转换的TransPoints处确保信息流畅传递，以此桥接Transformer与Mamba。此外，TransPoint调度策略也得到了深入探索以进一步提升性能。通过大量实验，我们证实了TransMamba在训练效率和性能上均优于基线模型，并验证了Transformer与Mamba范式之间更深层次的一致性，为下一代序列建模提供了一个可扩展的解决方案。

English

Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that unifies Transformer and Mamba through shared parameter matrices (e.g., QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for further improvements. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to baselines, and validated the deeper consistency between Transformer and Mamba paradigms, offering a scalable solution for next-generation sequence modeling.

TransMamba：在Transformer与Mamba间灵活切换

TransMamba: Flexibly Switching between Transformer and Mamba

摘要

Summary

Support

Support