EfficientViM:基於隱藏狀態混合器的高效視覺曼巴與狀態空間雙重性

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

November 22, 2024
作者: Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim
cs.AI

摘要

為了在資源受限的環境中部署神經網絡,先前的研究已建立了具有捕捉局部和全局依賴性的輕量級結構,分別使用卷積和注意力。最近,狀態空間模型已成為一種有效的全局標記交互方式,其在標記數量上具有有利的線性計算成本。然而,利用SSM構建的高效視覺骨幹鮮少被探索。在本文中,我們介紹了Efficient Vision Mamba(EfficientViM),這是一種基於隱藏狀態混合器的狀態空間對偶(HSM-SSD)構建的新型結構,可以高效地捕捉全局依賴性,並進一步降低計算成本。在HSM-SSD層中,我們重新設計了先前的SSD層,以實現在隱藏狀態內進行通道混合操作。此外,我們提出了多階段隱藏狀態融合,進一步增強隱藏狀態的表示能力,並提供了設計以緩解由內存限制操作引起的瓶頸。因此,EfficientViM系列在ImageNet-1k上實現了一種新的最先進的速度-準確性折衷,比第二好的SHViT模型提供了高達0.7%的性能改進,並且速度更快。此外,與先前的研究相比,當擴展圖像或使用蒸餾訓練時,我們觀察到吞吐量和準確性方面的顯著改進。代碼可在https://github.com/mlvlab/EfficientViM找到。
English
For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model has emerged as an effective global token interaction with its favorable linear computational cost in the number of tokens. Yet, efficient vision backbones built with SSM have been explored less. In this paper, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. In the HSM-SSD layer, we redesign the previous SSD layer to enable the channel mixing operation within hidden states. Additionally, we propose multi-stage hidden state fusion to further reinforce the representation power of hidden states, and provide the design alleviating the bottleneck caused by the memory-bound operations. As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed. Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training. Code is available at https://github.com/mlvlab/EfficientViM.

Summary

AI-Generated Summary

PDF52November 27, 2024