Sa2VA：SAM2とLLaVAを結びつけて、画像と動画の密なグラウンデッド理解を実現

要旨

この研究では、画像と動画の密なグラウンデッド理解のための初の統合モデルであるSa2VAを提案します。従来の多様なモーダルの大規模言語モデルが特定のモダリティやタスクに限定されるのに対し、Sa2VAは参照セグメンテーションや会話を含む広範囲の画像と動画タスクをサポートし、ワンショットの調整指示を最小限に抑えます。Sa2VAは、基盤となる動画セグメンテーションモデルであるSAM-2と、高度なビジョン言語モデルであるLLaVAを組み合わせ、テキスト、画像、動画を共有のLLMトークン空間に統合します。LLMを使用して、Sa2VAはSAM-2を導き、正確なマスクを生成する指示トークンを生成し、静的および動的な視覚コンテンツのグラウンデッドで多様な理解を実現します。さらに、複雑な動画シーン内の72,000以上のオブジェクト表現を含む自動ラベル付けされたRef-SAVデータセットを導入し、モデルの性能向上を図ります。また、Ref-SAVデータセット内の2,000のビデオオブジェクトを手動で検証し、複雑な環境での参照ビデオオブジェクトセグメンテーションをベンチマークとします。実験結果は、Sa2VAが複数のタスクで最先端の性能を達成し、特に参照ビデオオブジェクトセグメンテーションにおいてその複雑な現実世界への適用可能性を示しています。

English

This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.

Sa2VA：SAM2とLLaVAを結びつけて、画像と動画の密なグラウンデッド理解を実現

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

要旨

Support