高效追蹤任何事物
Efficient Track Anything
November 28, 2024
作者: Yunyang Xiong, Chong Zhou, Xiaoyu Xiang, Lemeng Wu, Chenchen Zhu, Zechun Liu, Saksham Suri, Balakrishnan Varadarajan, Ramya Akula, Forrest Iandola, Raghuraman Krishnamoorthi, Bilge Soran, Vikas Chandra
cs.AI
摘要
Segment Anything Model 2(SAM 2)已成為影片物件分割和追蹤任何事物的強大工具。SAM 2 的關鍵組件包括用於提取幀特徵的大型多階段影像編碼器,以及一個記憶機制,用於存儲來自過去幀的記憶內容,以幫助當前幀的分割。多階段影像編碼器和記憶模組的高計算複雜度限制了其在現實任務中的應用,例如在移動設備上進行影片物件分割。為解決這一限制,我們提出了EfficientTAMs,輕量級追蹤任何事物模型,以低延遲和模型大小產生高質量結果。我們的想法是重新審視普通的非階層式視覺Transformer(ViT)作為影像編碼器,並引入一個高效的記憶模組,從而降低幀特徵提取和當前幀分割的記憶計算的複雜度。我們使用原始輕量級ViTs和高效記憶模組構建EfficientTAMs,並在SA-1B和SA-V數據集上對影片物件分割和追蹤任何事物任務進行訓練。我們在多個影片分割基準測試上進行評估,包括半監督VOS和可提示的影片分割,發現我們提出的EfficientTAM與原始ViT在A100上的速度提升約為SAM 2模型(HieraB+SAM 2)的2倍,參數減少約為2.4倍。在分割任何圖像任務上,我們的EfficientTAMs也比原始SAM表現出色,在A100上的速度提升約為20倍,參數減少約為20倍。在諸如iPhone 15 Pro Max等移動設備上,我們的EfficientTAMs可以以約10 FPS運行,以合理的質量執行影片物件分割,突顯小型模型在設備上影片物件分割應用中的能力。
English
Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video
object segmentation and tracking anything. Key components of SAM 2 that drive
the impressive video object segmentation performance include a large multistage
image encoder for frame feature extraction and a memory mechanism that stores
memory contexts from past frames to help current frame segmentation. The high
computation complexity of multistage image encoder and memory module has
limited its applications in real-world tasks, e.g., video object segmentation
on mobile devices. To address this limitation, we propose EfficientTAMs,
lightweight track anything models that produce high-quality results with low
latency and model size. Our idea is based on revisiting the plain,
nonhierarchical Vision Transformer (ViT) as an image encoder for video object
segmentation, and introducing an efficient memory module, which reduces the
complexity for both frame feature extraction and memory computation for current
frame segmentation. We take vanilla lightweight ViTs and efficient memory
module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets
for video object segmentation and track anything tasks. We evaluate on multiple
video segmentation benchmarks including semi-supervised VOS and promptable
video segmentation, and find that our proposed EfficientTAM with vanilla ViT
perform comparably to SAM 2 model (HieraB+SAM 2) with ~2x speedup on A100 and
~2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs
also perform favorably over original SAM with ~20x speedup on A100 and ~20x
parameter reduction. On mobile devices such as iPhone 15 Pro Max, our
EfficientTAMs can run at ~10 FPS for performing video object segmentation with
reasonable quality, highlighting the capability of small models for on-device
video object segmentation applications.Summary
AI-Generated Summary