Mamba-YOLO-World:將YOLO-World與Mamba結合以進行開放詞彙偵測
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
September 13, 2024
作者: Haoxuan Wang, Qingdong He, Jinlong Peng, Hao Yang, Mingmin Chi, Yabiao Wang
cs.AI
摘要
開放詞彙偵測(OVD)旨在偵測超出預定類別集的物件。作為將 YOLO 系列納入 OVD 的開創性模型,YOLO-World 非常適合強調速度和效率的情境。然而,其性能受到其頸部特徵融合機制的阻礙,導致二次複雜度和有限的引導感受野。為解決這些限制,我們提出了 Mamba-YOLO-World,一個採用提出的 MambaFusion Path Aggregation Network(MambaFusion-PAN)作為其頸部架構的新穎基於 YOLO 的 OVD 模型。具體而言,我們引入了一種基於狀態空間模型的特徵融合機制,包括具有線性複雜度和全局引導感受野的平行引導選擇掃描算法和串行引導選擇掃描算法。它利用多模態輸入序列和蛇行隱藏狀態來引導選擇性掃描過程。實驗表明,我們的模型在 COCO 和 LVIS 基準測試中,在零樣本和微調設置下均優於原始 YOLO-World,同時保持可比的參數和 FLOPs。此外,它以更少的參數和 FLOPs 超越現有的最先進 OVD 方法。
English
Open-vocabulary detection (OVD) aims to detect objects beyond a predefined
set of categories. As a pioneering model incorporating the YOLO series into
OVD, YOLO-World is well-suited for scenarios prioritizing speed and
efficiency.However, its performance is hindered by its neck feature fusion
mechanism, which causes the quadratic complexity and the limited guided
receptive fields.To address these limitations, we present Mamba-YOLO-World, a
novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation
Network (MambaFusion-PAN) as its neck architecture. Specifically, we introduce
an innovative State Space Model-based feature fusion mechanism consisting of a
Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan
algorithm with linear complexity and globally guided receptive fields. It
leverages multi-modal input sequences and mamba hidden states to guide the
selective scanning process.Experiments demonstrate that our model outperforms
the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and
fine-tuning settings while maintaining comparable parameters and FLOPs.
Additionally, it surpasses existing state-of-the-art OVD methods with fewer
parameters and FLOPs.Summary
AI-Generated Summary