Sa2VA:将SAM2与LLaVA相结合,实现图像和视频的密集语境理解

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

January 7, 2025
作者: Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang
cs.AI

摘要

本文介绍了Sa2VA,这是第一个用于图像和视频的密集基础理解的统一模型。与现有的多模态大型语言模型不同,这些模型通常局限于特定的模态和任务,Sa2VA支持广泛的图像和视频任务,包括指代分割和对话,只需进行最少的一次性指导调整。Sa2VA将SAM-2(一个基础视频分割模型)与LLaVA(一个先进的视觉语言模型)相结合,并将文本、图像和视频统一到共享的LLM标记空间中。利用LLM,Sa2VA生成指导SAM-2生成精确蒙版的指令标记,实现对静态和动态视觉内容的基础、多模态理解。此外,我们引入了Ref-SAV,一个包含超过72k个复杂视频场景中物体表达的自动标记数据集,旨在提高模型性能。我们还手动验证了Ref-SAV数据集中的2k个视频物体,以评估在复杂环境中的指代视频物体分割。实验表明,Sa2VA在多个任务上实现了最先进的水平,特别是在指代视频物体分割方面,突显了其在复杂实际应用中的潜力。
English
This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.

Summary

AI-Generated Summary

PDF402January 8, 2025