OmniMamba:基于状态空间模型的高效统一多模态理解与生成
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
March 11, 2025
作者: Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang
cs.AI
摘要
近期,统一多模态理解与视觉生成(或多模态生成)模型的发展,受限于其二次方的计算复杂度及对大规模训练数据的依赖。我们推出了OmniMamba,首个基于线性架构的多模态生成模型,通过统一的下一标记预测范式,同时生成文本与图像。该模型充分利用了Mamba-2的高计算与内存效率,将其能力从文本生成扩展至多模态生成。针对现有统一模型的数据效率低下问题,我们提出了两项关键创新:(1) 解耦词汇表以指导特定模态的生成,(2) 任务特定的LoRA实现参数高效适配。此外,我们引入了解耦的两阶段训练策略,以缓解两项任务间的数据不平衡。得益于这些技术,OmniMamba在仅训练2百万图文对的情况下,性能与JanusFlow相当,并在多个基准测试中超越Show-o,而训练数据量仅为后者的千分之一。尤为突出的是,OmniMamba在推理效率上表现卓越,相比基于Transformer的模型,在长序列生成上实现了高达119.2倍的加速,并减少了63%的GPU内存占用。代码与模型已发布于https://github.com/hustvl/OmniMamba。
English
Recent advancements in unified multimodal understanding and visual generation
(or multimodal generation) models have been hindered by their quadratic
computational complexity and dependence on large-scale training data. We
present OmniMamba, the first linear-architecture-based multimodal generation
model that generates both text and images through a unified next-token
prediction paradigm. The model fully leverages Mamba-2's high computational and
memory efficiency, extending its capabilities from text generation to
multimodal generation. To address the data inefficiency of existing unified
models, we propose two key innovations: (1) decoupled vocabularies to guide
modality-specific generation, and (2) task-specific LoRA for
parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage
training strategy to mitigate data imbalance between two tasks. Equipped with
these techniques, OmniMamba achieves competitive performance with JanusFlow
while surpassing Show-o across benchmarks, despite being trained on merely 2M
image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba
stands out with outstanding inference efficiency, achieving up to a 119.2 times
speedup and 63% GPU memory reduction for long-sequence generation compared to
Transformer-based counterparts. Code and models are released at
https://github.com/hustvl/OmniMambaSummary
AI-Generated Summary