ReferEverything：朝向在影片中對所有可言及之事物進行分割

摘要

我們提出 REM，一個用於在影片中對通過自然語言描述的各種概念進行分割的框架。我們的方法利用在互聯網規模數據集上通過視頻擴散模型學習的視覺語言表示。我們方法的一個關鍵見解是在狹域引用對象分割數據集上微調時，盡可能保留生成模型的原始表示。因此，我們的框架能夠準確地分割和跟踪罕見和未見過的對象，儘管是在來自有限類別集的對象遮罩上進行訓練。此外，它可以推廣到非對象動態概念，例如在我們新引入的引用視頻處理分割（Ref-VPS）基準測試中展示的海浪拍打等。我們的實驗表明，REM 在域內數據集（如 Ref-DAVIS）上的表現與最先進的方法相當，同時在域外數據上的區域相似性方面超越它們多達十二個百分點，利用了互聯網規模預訓練的優勢。

English

We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method capitalizes on visual-language representations learned by video diffusion models on Internet-scale datasets. A key insight of our approach is preserving as much of the generative model's original representation as possible, while fine-tuning it on narrow-domain Referral Object Segmentation datasets. As a result, our framework can accurately segment and track rare and unseen objects, despite being trained on object masks from a limited set of categories. Additionally, it can generalize to non-object dynamic concepts, such as waves crashing in the ocean, as demonstrated in our newly introduced benchmark for Referral Video Process Segmentation (Ref-VPS). Our experiments show that REM performs on par with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while outperforming them by up to twelve points in terms of region similarity on out-of-domain data, leveraging the power of Internet-scale pre-training.

ReferEverything：朝向在影片中對所有可言及之事物進行分割

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

摘要

Summary

Support

Support