SAM2Long：使用無需訓練的記憶樹增強 SAM 2 進行長視頻分割

摘要

Segment Anything Model 2（SAM 2）已成為圖像和視頻中物體分割的強大基礎模型，為各種下游視頻應用鋪平了道路。SAM 2 在視頻分割的關鍵設計是其記憶模塊，該模塊從先前幀中提取對當前幀預測有意義的記憶。然而，其貪婪選擇記憶設計存在“錯誤累積”問題，即一個錯誤或遺漏的遮罩將連鎖影響後續幀的分割，這限制了 SAM 2 對於複雜長期視頻的性能。為此，我們引入 SAM2Long，一種改進的無需訓練的視頻物體分割策略，該策略考慮每幀內的分割不確定性，並以受限樹搜索方式從多個分割路徑中選擇視頻級最優結果。在實踐中，我們在整個視頻中保持固定數量的分割路徑。對於每一幀，基於現有路徑提出多個遮罩，創建各種候選分支。然後，我們選擇具有較高累積分數的相同固定數量的分支作為下一幀的新路徑。在處理最後一幀後，選擇具有最高累積分數的路徑作為最終的分割結果。由於其啟發式搜索設計，SAM2Long 對遮擋和物體再出現具有魯棒性，能夠有效地分割和跟踪複雜的長期視頻中的物體。值得注意的是，SAM2Long 在所有 24 個頭對頭比較中實現了平均 3.0 分的改進，並在長期視頻物體分割基準測試（如 SA-V 和 LVOS）中 J&F 方面取得高達 5.3 分的增益。代碼已發布在 https://github.com/Mark12Ding/SAM2Long。

English

The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundation model for object segmentation in both images and videos, paving the way for various downstream video applications. The crucial design of SAM 2 for video segmentation is its memory module, which prompts object-aware memories from previous frames for current frame prediction. However, its greedy-selection memory design suffers from the "error accumulation" problem, where an errored or missed mask will cascade and influence the segmentation of the subsequent frames, which limits the performance of SAM 2 toward complex long-term videos. To this end, we introduce SAM2Long, an improved training-free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways in a constrained tree search manner. In practice, we maintain a fixed number of segmentation pathways throughout the video. For each frame, multiple masks are proposed based on the existing pathways, creating various candidate branches. We then select the same fixed number of branches with higher cumulative scores as the new pathways for the next frame. After processing the final frame, the pathway with the highest cumulative score is chosen as the final segmentation result. Benefiting from its heuristic search design, SAM2Long is robust toward occlusions and object reappearances, and can effectively segment and track objects for complex long-term videos. Notably, SAM2Long achieves an average improvement of 3.0 points across all 24 head-to-head comparisons, with gains of up to 5.3 points in J&F on long-term video object segmentation benchmarks such as SA-V and LVOS. The code is released at https://github.com/Mark12Ding/SAM2Long.

SAM2Long：使用無需訓練的記憶樹增強 SAM 2 進行長視頻分割

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

摘要

Summary

Support

Support