Mixture-of-Mamba: 모달리티 인식 희소성을 갖춘 다중 모달 상태 공간 모델 개선

초록

상태 공간 모델(SSM)은 순차적 모델링을 위한 효율적인 대안으로 등장했지만, 모드별 특징을 활용할 수 없어 다중 모달 사전 훈련에서 성능이 제한된다. 본 연구에서는 모드 인식 희소성을 도입하는 새로운 SSM 아키텍처인 Mixture-of-Mamba를 제안한다. Mamba 블록의 모드별 매개변수화를 통해 모드별 특징을 도입한다. Mixture-of-Transformers(W. Liang et al. arXiv:2411.04996; 2024)를 기반으로 하여 모드 인식 희소성의 이점을 SSM에 확장하면서 계산 효율성을 유지한다. 우리는 Mixture-of-Mamba를 세 가지 다중 모달 사전 훈련 설정에서 평가한다: Transfusion(교차된 텍스트 및 연속 이미지 토큰과 확산 손실), Chameleon(교차된 텍스트 및 이산 이미지 토큰), 그리고 음성을 포함한 확장된 세 모드 프레임워크. Mixture-of-Mamba는 일관되게 동일한 손실 값에 더 빨리 도달하면서 상당히 감소된 계산 비용을 보여준다. Transfusion 설정에서 Mixture-of-Mamba는 1.4B 규모에서 훈련 FLOP의 34.76%만 사용하여 동등한 이미지 손실을 달성한다. Chameleon 설정에서 Mixture-of-Mamba는 1.4B 규모에서 FLOP의 42.50%만 사용하여 유사한 이미지 손실을 달성하고, FLOP의 65.40%만 사용하여 유사한 텍스트 손실을 달성한다. 세 모드 설정에서 MoM은 1.4B 규모에서 FLOP의 24.80%만 사용하여 음성 손실을 일치시킨다. 우리의 제거 연구는 투영 구성 요소의 상호 분리의 상호 작용 효과를 강조하며, 공동 분리가 개별 수정보다 더 큰 이득을 제공함을 보여준다. 이러한 결과는 모드 인식 희소성을 다중 모달 사전 훈련에서 새로운 기준을 설정하며, Transformers에서 SSM으로 그 영향을 확장하는 다재다능하고 효과적인 설계 원칙으로 확립된다. 우리의 코드는 https://github.com/Weixin-Liang/Mixture-of-Mamba에서 확인할 수 있다.

English

State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba

Mixture-of-Mamba: 모달리티 인식 희소성을 갖춘 다중 모달 상태 공간 모델 개선

Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity

초록

Summary

Support