ChatPaper.aiChatPaper

一個令牠們分割的令牌:語言引導的視頻推理分割

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

September 29, 2024
作者: Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, Mike Zheng Shou
cs.AI

摘要

我們介紹了VideoLISA,一個基於影片的多模態大型語言模型,旨在應對影片中基於語言指示的推理分割問題。利用大型語言模型的推理能力和世界知識,並借助Segment Anything模型的增強,VideoLISA根據語言指示在影片中生成時間上一致的分割遮罩。現有基於影像的方法,如LISA,由於額外的時間維度而在處理影片任務時遇到困難,這需要對時間動態進行理解並實現跨幀的一致分割。VideoLISA通過將稀疏密集採樣策略整合到影片-LLM中來應對這些挑戰,這有助於在計算限制內平衡時間上下文和空間細節。此外,我們提出了一種使用特殊設計的<TRK>標記的One-Token-Seg-All方法,使模型能夠跨多個幀分割和追蹤物件。在包括我們新引入的ReasonVOS基準測試在內的多個基準測試上進行了廣泛評估,顯示了VideoLISA在涉及複雜推理、時間理解和物件追蹤的影片物件分割任務中優異的性能。雖然針對影片進行了優化,但VideoLISA還展示了對圖像分割的潛在泛化能力,揭示了其作為語言指示物件分割的統一基礎模型的潛力。代碼和模型將在以下鏈接提供:https://github.com/showlab/VideoLISA。
English
We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image-based methods, such as LISA, struggle with video tasks due to the additional temporal dimension, which requires temporal dynamic understanding and consistent segmentation across frames. VideoLISA addresses these challenges by integrating a Sparse Dense Sampling strategy into the video-LLM, which balances temporal context and spatial detail within computational constraints. Additionally, we propose a One-Token-Seg-All approach using a specially designed <TRK> token, enabling the model to segment and track objects across multiple frames. Extensive evaluations on diverse benchmarks, including our newly introduced ReasonVOS benchmark, demonstrate VideoLISA's superior performance in video object segmentation tasks involving complex reasoning, temporal understanding, and object tracking. While optimized for videos, VideoLISA also shows promising generalization to image segmentation, revealing its potential as a unified foundation model for language-instructed object segmentation. Code and model will be available at: https://github.com/showlab/VideoLISA.

Summary

AI-Generated Summary

PDF193November 13, 2024