Mixture-of-Mamba：利用模态感知稀疏性增强多模态状态空间模型

摘要

状态空间模型（SSMs）已成为顺序建模中高效的变换器的替代选择，但它们无法利用特定模态的特征，限制了它们在多模态预训练中的性能。在这里，我们提出了Mixture-of-Mamba，这是一种新颖的SSM架构，通过对Mamba块进行特定模态参数化，引入了模态感知稀疏性。在Mixture-of-Transformers（W. Liang等，arXiv:2411.04996；2024）的基础上，我们将模态感知稀疏性的好处扩展到SSMs，同时保持它们的计算效率。我们在三个多模态预训练设置中评估了Mixture-of-Mamba：Transfusion（交错文本和连续图像标记与扩散损失）、Chameleon（交错文本和离散图像标记）以及包含语音的扩展三模态框架。Mixture-of-Mamba始终在较早的训练步骤中达到相同的损失值，同时显著降低了计算成本。在Transfusion设置中，Mixture-of-Mamba在1.4B规模下仅使用34.76%的训练FLOPs即可实现等效的图像损失。在Chameleon设置中，Mixture-of-Mamba在1.4B规模下仅使用42.50%的FLOPs即可达到类似的图像损失，仅使用65.40%的FLOPs即可达到类似的文本损失。在三模态设置中，MoM在1.4B规模下仅使用24.80%的FLOPs即可匹配语音损失。我们的消融研究突出了投影组件解耦的协同效应，其中联合解耦产生的收益大于单独的修改。这些结果确立了模态感知稀疏性作为一种多才多艺且有效的设计原则，将其影响从变换器扩展到SSMs，并在多模态预训练中设立了新的基准。我们的代码可在https://github.com/Weixin-Liang/Mixture-of-Mamba 上访问。

English

State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba

Mixture-of-Mamba：利用模态感知稀疏性增强多模态状态空间模型

Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity

摘要

Summary

Support