参考一切：朝着在视频中对我们能够谈论的一切进行分割

摘要

我们提出了REM，这是一个用于在视频中对通过自然语言描述的各种概念进行分割的框架。我们的方法利用了在互联网规模数据集上通过视频扩散模型学习的视觉-语言表示。我们方法的一个关键见解是尽可能保留生成模型的原始表示，同时在狭领域Referral Object Segmentation数据集上进行微调。因此，我们的框架能够准确地分割和跟踪罕见和未见过的对象，尽管是在一组有限类别的对象掩模上进行训练的。此外，它可以推广到非对象动态概念，例如在我们新引入的Referral Video Process Segmentation（Ref-VPS）基准测试中展示的海浪拍打等情况。我们的实验表明，REM在领域内数据集（如Ref-DAVIS）上表现与最先进方法相当，而在领域外数据上，根据区域相似性的表现超过它们高达十二个百分点，利用了互联网规模预训练的优势。

English

We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method capitalizes on visual-language representations learned by video diffusion models on Internet-scale datasets. A key insight of our approach is preserving as much of the generative model's original representation as possible, while fine-tuning it on narrow-domain Referral Object Segmentation datasets. As a result, our framework can accurately segment and track rare and unseen objects, despite being trained on object masks from a limited set of categories. Additionally, it can generalize to non-object dynamic concepts, such as waves crashing in the ocean, as demonstrated in our newly introduced benchmark for Referral Video Process Segmentation (Ref-VPS). Our experiments show that REM performs on par with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while outperforming them by up to twelve points in terms of region similarity on out-of-domain data, leveraging the power of Internet-scale pre-training.

参考一切：朝着在视频中对我们能够谈论的一切进行分割

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

摘要

Summary

Support

Support