Sa2VA:將SAM2與LLaVA結合,實現對影像和視頻的密集基於實例的理解

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

January 7, 2025
作者: Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang
cs.AI

摘要

本研究提出了Sa2VA,這是第一個統一的模型,用於密集地理解圖像和視頻。與現有的多模態大型語言模型不同,這些模型通常僅限於特定的模態和任務,Sa2VA支持廣泛的圖像和視頻任務,包括指涉分割和對話,並且只需進行最少的一次性指導調整。Sa2VA將SAM-2(一個基礎視頻分割模型)與LLaVA(一個先進的視覺語言模型)結合起來,並將文本、圖像和視頻統一到共享的LLM標記空間中。使用LLM,Sa2VA生成指導SAM-2生成精確遮罩的指令標記,實現對靜態和動態視覺內容的基礎多模態理解。此外,我們介紹了Ref-SAV,這是一個包含超過72k個複雜視頻場景中物體表達的自動標記數據集,旨在提高模型性能。我們還手動驗證了Ref-SAV數據集中的2k個視頻對象,以評估在複雜環境中的指涉視頻對象分割。實驗表明,Sa2VA在多個任務中實現了最先進的水平,特別是在指涉視頻對象分割方面,突顯了其在複雜現實應用中的潛力。
English
This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.

Summary

AI-Generated Summary

PDF402January 8, 2025