Mamba作为桥梁:视觉基础模型与视觉语言模型在领域泛化语义分割中的交汇
Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation
April 4, 2025
作者: Xin Zhang, Robby T. Tan
cs.AI
摘要
视觉基础模型(VFMs)与视觉-语言模型(VLMs)因其强大的泛化能力,在领域泛化语义分割(DGSS)中备受关注。然而,现有的DGSS方法往往仅依赖VFMs或VLMs之一,忽视了二者互补的优势。VFMs(如DINOv2)擅长捕捉细粒度特征,而VLMs(如CLIP)虽提供稳健的文本对齐能力,却在粗粒度处理上表现欠佳。尽管二者优势互补,但如何通过注意力机制有效整合VFMs与VLMs仍具挑战,因为增加的补丁标记使得长序列建模更为复杂。为此,我们提出了MFuser,一种基于Mamba的新型融合框架,它高效结合了VFMs与VLMs的优势,同时保持了序列长度的线性可扩展性。MFuser包含两大核心组件:MVFuser,作为协同适配器,通过捕捉序列与空间动态联合微调两个模型;以及MTEnhancer,一种混合注意力-Mamba模块,通过融入图像先验来精炼文本嵌入。我们的方法在不显著增加计算开销的前提下,实现了精确的特征局部性与强健的文本对齐。大量实验表明,MFuser在合成到真实及真实到真实基准测试中分别取得了68.20 mIoU和71.87 mIoU的优异成绩,显著超越了当前最先进的DGSS方法。代码已公开于https://github.com/devinxzhang/MFuser。
English
Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained
traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong
generalization capabilities. However, existing DGSS methods often rely
exclusively on either VFMs or VLMs, overlooking their complementary strengths.
VFMs (e.g., DINOv2) excel at capturing fine-grained features, while VLMs (e.g.,
CLIP) provide robust text alignment but struggle with coarse granularity.
Despite their complementary strengths, effectively integrating VFMs and VLMs
with attention mechanisms is challenging, as the increased patch tokens
complicate long-sequence modeling. To address this, we propose MFuser, a novel
Mamba-based fusion framework that efficiently combines the strengths of VFMs
and VLMs while maintaining linear scalability in sequence length. MFuser
consists of two key components: MVFuser, which acts as a co-adapter to jointly
fine-tune the two models by capturing both sequential and spatial dynamics; and
MTEnhancer, a hybrid attention-Mamba module that refines text embeddings by
incorporating image priors. Our approach achieves precise feature locality and
strong text alignment without incurring significant computational overhead.
Extensive experiments demonstrate that MFuser significantly outperforms
state-of-the-art DGSS methods, achieving 68.20 mIoU on synthetic-to-real and
71.87 mIoU on real-to-real benchmarks. The code is available at
https://github.com/devinxzhang/MFuser.Summary
AI-Generated Summary