Mamba 作為橋樑：視覺基礎模型與視覺語言模型在領域泛化語義分割中的交匯

摘要

视觉基础模型（VFMs）和视觉-语言模型（VLMs）因其强大的泛化能力，在领域泛化语义分割（DGSS）中获得了广泛关注。然而，现有的DGSS方法往往仅依赖VFMs或VLMs之一，忽视了它们之间的互补优势。VFMs（如DINOv2）擅长捕捉细粒度特征，而VLMs（如CLIP）则提供了稳健的文本对齐能力，但在粗粒度上表现欠佳。尽管它们具有互补优势，但通过注意力机制有效整合VFMs和VLMs仍面临挑战，因为增加的补丁标记使得长序列建模变得复杂。为此，我们提出了MFuser，一种基于Mamba的新型融合框架，能够高效结合VFMs和VLMs的优势，同时保持序列长度的线性可扩展性。MFuser包含两个关键组件：MVFuser，作为联合微调这两个模型的共适配器，捕捉序列和空间动态；以及MTEnhancer，一种混合注意力-Mamba模块，通过融入图像先验来优化文本嵌入。我们的方法在不显著增加计算开销的情况下，实现了精确的特征局部性和强大的文本对齐能力。大量实验表明，MFuser在合成到真实和真实到真实的基准测试中分别达到了68.20 mIoU和71.87 mIoU，显著优于现有的DGSS方法。代码可在https://github.com/devinxzhang/MFuser获取。

English

Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong generalization capabilities. However, existing DGSS methods often rely exclusively on either VFMs or VLMs, overlooking their complementary strengths. VFMs (e.g., DINOv2) excel at capturing fine-grained features, while VLMs (e.g., CLIP) provide robust text alignment but struggle with coarse granularity. Despite their complementary strengths, effectively integrating VFMs and VLMs with attention mechanisms is challenging, as the increased patch tokens complicate long-sequence modeling. To address this, we propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs while maintaining linear scalability in sequence length. MFuser consists of two key components: MVFuser, which acts as a co-adapter to jointly fine-tune the two models by capturing both sequential and spatial dynamics; and MTEnhancer, a hybrid attention-Mamba module that refines text embeddings by incorporating image priors. Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead. Extensive experiments demonstrate that MFuser significantly outperforms state-of-the-art DGSS methods, achieving 68.20 mIoU on synthetic-to-real and 71.87 mIoU on real-to-real benchmarks. The code is available at https://github.com/devinxzhang/MFuser.

Mamba 作為橋樑：視覺基礎模型與視覺語言模型在領域泛化語義分割中的交匯

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

摘要

Summary

Support

Support