SAM2Long:使用無需訓練的記憶樹增強 SAM 2 進行長視頻分割
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
October 21, 2024
作者: Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, Jiaqi Wang
cs.AI
摘要
Segment Anything Model 2(SAM 2)已成為圖像和視頻中物體分割的強大基礎模型,為各種下游視頻應用鋪平了道路。SAM 2 在視頻分割的關鍵設計是其記憶模塊,該模塊從先前幀中提取對當前幀預測有意義的記憶。然而,其貪婪選擇記憶設計存在“錯誤累積”問題,即一個錯誤或遺漏的遮罩將連鎖影響後續幀的分割,這限制了 SAM 2 對於複雜長期視頻的性能。為此,我們引入 SAM2Long,一種改進的無需訓練的視頻物體分割策略,該策略考慮每幀內的分割不確定性,並以受限樹搜索方式從多個分割路徑中選擇視頻級最優結果。在實踐中,我們在整個視頻中保持固定數量的分割路徑。對於每一幀,基於現有路徑提出多個遮罩,創建各種候選分支。然後,我們選擇具有較高累積分數的相同固定數量的分支作為下一幀的新路徑。在處理最後一幀後,選擇具有最高累積分數的路徑作為最終的分割結果。由於其啟發式搜索設計,SAM2Long 對遮擋和物體再出現具有魯棒性,能夠有效地分割和跟踪複雜的長期視頻中的物體。值得注意的是,SAM2Long 在所有 24 個頭對頭比較中實現了平均 3.0 分的改進,並在長期視頻物體分割基準測試(如 SA-V 和 LVOS)中 J&F 方面取得高達 5.3 分的增益。代碼已發布在 https://github.com/Mark12Ding/SAM2Long。
English
The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundation
model for object segmentation in both images and videos, paving the way for
various downstream video applications. The crucial design of SAM 2 for video
segmentation is its memory module, which prompts object-aware memories from
previous frames for current frame prediction. However, its greedy-selection
memory design suffers from the "error accumulation" problem, where an errored
or missed mask will cascade and influence the segmentation of the subsequent
frames, which limits the performance of SAM 2 toward complex long-term videos.
To this end, we introduce SAM2Long, an improved training-free video object
segmentation strategy, which considers the segmentation uncertainty within each
frame and chooses the video-level optimal results from multiple segmentation
pathways in a constrained tree search manner. In practice, we maintain a fixed
number of segmentation pathways throughout the video. For each frame, multiple
masks are proposed based on the existing pathways, creating various candidate
branches. We then select the same fixed number of branches with higher
cumulative scores as the new pathways for the next frame. After processing the
final frame, the pathway with the highest cumulative score is chosen as the
final segmentation result. Benefiting from its heuristic search design,
SAM2Long is robust toward occlusions and object reappearances, and can
effectively segment and track objects for complex long-term videos. Notably,
SAM2Long achieves an average improvement of 3.0 points across all 24
head-to-head comparisons, with gains of up to 5.3 points in J&F on long-term
video object segmentation benchmarks such as SA-V and LVOS. The code is
released at https://github.com/Mark12Ding/SAM2Long.Summary
AI-Generated Summary