멀티모달 맘바: 2차에서 선형으로의 증류를 통한 디코더 전용 멀티모달 상태 공간 모델

초록

최근의 다중 모달 대형 언어 모델(MLLMs)은 뛰어난 성능을 달성했지만, 이차 계산 복잡도, 증가하는 Key-Value 캐시 요구사항, 그리고 별도의 비전 인코더에 대한 의존성으로 인해 배포에 어려움을 겪고 있습니다. 우리는 mmMamba를 제안합니다. 이는 적절한 학술적 계산 자원을 사용하여 기존 MLLMs로부터 점진적인 지식 증류를 통해 선형 복잡도의 네이티브 다중 모달 상태 공간 모델을 개발하는 프레임워크입니다. 우리의 접근 방식은 사전 훈련된 RNN 기반 LLM이나 비전 인코더가 필요 없이 훈련된 디코더 전용 MLLMs를 선형 복잡도 아키텍처로 직접 변환할 수 있게 합니다. 우리는 훈련된 Transformer에서 Mamba를 도출하기 위한 시드 전략과 Transformer의 지식을 효과적으로 Mamba로 전달하면서 다중 모달 능력을 보존하는 3단계 증류 방법을 제안합니다. 또한, 우리의 방법은 Transformer와 Mamba 레이어를 결합하여 효율성과 성능 간의 맞춤형 균형을 지원하는 유연한 하이브리드 아키텍처도 지원합니다. Transformer 기반 디코더 전용 HoVLE로부터 증류된 mmMamba-linear는 기존의 선형 및 이차 복잡도 VLM들과 경쟁력 있는 성능을 보이며, mmMamba-hybrid는 HoVLE의 성능에 근접할 만큼 성능을 크게 향상시킵니다. 103K 토큰에서 mmMamba-linear는 HoVLE 대비 20.6배의 속도 향상과 75.8%의 GPU 메모리 감소를 보여주며, mmMamba-hybrid는 13.5배의 속도 향상과 60.2%의 메모리 절감을 달성합니다. 코드와 모델은 https://github.com/hustvl/mmMamba에서 공개되었습니다.

English

Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine Transformer and Mamba layers for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mmMamba-linear achieves competitive performance against existing linear and quadratic-complexity VLMs, while mmMamba-hybrid further improves performance significantly, approaching HoVLE's capabilities. At 103K tokens, mmMamba-linear demonstrates 20.6times speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves 13.5times speedup and 60.2% memory savings. Code and models are released at https://github.com/hustvl/mmMamba

멀티모달 맘바: 2차에서 선형으로의 증류를 통한 디코더 전용 멀티모달 상태 공간 모델

Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

초록

Support