ReferEverything:朝向在影片中對所有可言及之事物進行分割
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
October 30, 2024
作者: Anurag Bagchi, Zhipeng Bao, Yu-Xiong Wang, Pavel Tokmakov, Martial Hebert
cs.AI
摘要
我們提出 REM,一個用於在影片中對通過自然語言描述的各種概念進行分割的框架。我們的方法利用在互聯網規模數據集上通過視頻擴散模型學習的視覺語言表示。我們方法的一個關鍵見解是在狹域引用對象分割數據集上微調時,盡可能保留生成模型的原始表示。因此,我們的框架能夠準確地分割和跟踪罕見和未見過的對象,儘管是在來自有限類別集的對象遮罩上進行訓練。此外,它可以推廣到非對象動態概念,例如在我們新引入的引用視頻處理分割(Ref-VPS)基準測試中展示的海浪拍打等。我們的實驗表明,REM 在域內數據集(如 Ref-DAVIS)上的表現與最先進的方法相當,同時在域外數據上的區域相似性方面超越它們多達十二個百分點,利用了互聯網規模預訓練的優勢。
English
We present REM, a framework for segmenting a wide range of concepts in video
that can be described through natural language. Our method capitalizes on
visual-language representations learned by video diffusion models on
Internet-scale datasets. A key insight of our approach is preserving as much of
the generative model's original representation as possible, while fine-tuning
it on narrow-domain Referral Object Segmentation datasets. As a result, our
framework can accurately segment and track rare and unseen objects, despite
being trained on object masks from a limited set of categories. Additionally,
it can generalize to non-object dynamic concepts, such as waves crashing in the
ocean, as demonstrated in our newly introduced benchmark for Referral Video
Process Segmentation (Ref-VPS). Our experiments show that REM performs on par
with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while
outperforming them by up to twelve points in terms of region similarity on
out-of-domain data, leveraging the power of Internet-scale pre-training.Summary
AI-Generated Summary