DC-SAM：通過雙重一致性實現圖像與視頻中的上下文任意分割

摘要

給定單一標註樣本，上下文分割旨在分割對應的物體。這一設定，在少樣本學習中被稱為一次性分割，探索了分割模型的泛化能力，並已應用於多種視覺任務，包括場景理解與圖像/視頻編輯。儘管近期的Segment Anything Models在交互式分割中取得了頂尖成果，這些方法並不能直接應用於上下文分割。本研究中，我們基於提示調優提出了雙一致性SAM（DC-SAM）方法，以適應SAM和SAM2進行圖像與視頻的上下文分割。我們的核心洞見是通過提供高質量的視覺提示來增強SAM提示編碼器在分割中的特徵。在生成掩碼先驗時，我們融合SAM特徵以更好地對齊提示編碼器。隨後，我們設計了一種基於融合特徵與初始視覺提示的循環一致性交叉注意力機制。接著，通過在提示編碼器中使用區分性正負提示，提供了一種雙分支設計。此外，我們設計了一種簡單的掩碼管訓練策略，將提出的雙一致性方法應用於掩碼管中。雖然DC-SAM主要針對圖像設計，但在SAM2的支持下，它能無縫擴展至視頻領域。鑑於視頻領域缺乏上下文分割，我們從現有視頻分割數據集中手動整理並構建了首個基準，命名為上下文視頻物體分割（IC-VOS），以更好地評估模型的上下文能力。大量實驗表明，我們的方法在COCO-20i上達到了55.5（+1.4）的mIoU，在PASCAL-5i上達到了73.0（+1.1）的mIoU，並在提出的IC-VOS基準上獲得了71.52的J&F分數。我們的源代碼與基準可在https://github.com/zaplm/DC-SAM獲取。

English

Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at https://github.com/zaplm/DC-SAM.

DC-SAM：通過雙重一致性實現圖像與視頻中的上下文任意分割

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

摘要

Summary

Support

Support