ChatPaper.aiChatPaper

高效追踪任何事物

Efficient Track Anything

November 28, 2024
作者: Yunyang Xiong, Chong Zhou, Xiaoyu Xiang, Lemeng Wu, Chenchen Zhu, Zechun Liu, Saksham Suri, Balakrishnan Varadarajan, Ramya Akula, Forrest Iandola, Raghuraman Krishnamoorthi, Bilge Soran, Vikas Chandra
cs.AI

摘要

Segment Anything Model 2(SAM 2)已经成为视频对象分割和跟踪任何物体的强大工具。SAM 2 的关键组件包括用于帧特征提取的大型多阶段图像编码器和存储过去帧内存上下文以帮助当前帧分割的记忆机制。多阶段图像编码器和内存模块的高计算复杂性限制了其在现实世界任务中的应用,例如移动设备上的视频对象分割。为了解决这一局限性,我们提出了EfficientTAMs,轻量级跟踪任何物体模型,能够以低延迟和模型大小产生高质量结果。我们的想法是重新审视普通的、非层次化的Vision Transformer(ViT)作为视频对象分割的图像编码器,并引入一个高效的内存模块,既降低了帧特征提取的复杂性,又减少了当前帧分割的内存计算复杂性。我们采用普通轻量级ViTs和高效内存模块构建EfficientTAMs,并在SA-1B和SA-V数据集上对视频对象分割和跟踪任何任务进行训练。我们在多个视频分割基准上进行评估,包括半监督VOS和可提示的视频分割,发现我们提出的EfficientTAM与普通ViT相比,性能与SAM 2模型(HieraB+SAM 2)相当,在A100上速度提升约2倍,参数减少约2.4倍。在分割任何图像任务上,我们的EfficientTAMs也表现优于原始SAM,在A100上速度提升约20倍,参数减少约20倍。在iPhone 15 Pro Max等移动设备上,我们的EfficientTAMs可以以约10 FPS的速度运行,执行具有合理质量的视频对象分割,突显小型模型在设备端视频对象分割应用中的能力。
English
Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of multistage image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight track anything models that produce high-quality results with low latency and model size. Our idea is based on revisiting the plain, nonhierarchical Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semi-supervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with vanilla ViT perform comparably to SAM 2 model (HieraB+SAM 2) with ~2x speedup on A100 and ~2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ~20x speedup on A100 and ~20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAMs can run at ~10 FPS for performing video object segmentation with reasonable quality, highlighting the capability of small models for on-device video object segmentation applications.

Summary

AI-Generated Summary

PDF173December 3, 2024