VideoRefer套件：利用视频LLM推进时空对象理解

摘要

最近，视频大型语言模型（Video LLMs）在一般视频理解方面展现出了显著的能力。然而，它们主要侧重于整体理解，难以捕捉精细的空间和时间细节。此外，缺乏高质量的基于对象级别的视频指导数据和一个全面的基准进一步阻碍了它们的发展。为了解决这些挑战，我们引入了VideoRefer Suite，以增强Video LLM对更精细级别的空间-时间视频理解，即在整个视频中对任何对象进行感知和推理。特别是，我们全面发展了VideoRefer Suite 的三个关键方面：数据集、模型和基准。首先，我们引入了一个多智能体数据引擎，精心策划了一个大规模、高质量的基于对象级别的视频指导数据集，称为VideoRefer-700K。接下来，我们提出了VideoRefer 模型，该模型配备了多功能的空间-时间对象编码器，以捕捉精确的区域和序列表示。最后，我们精心创建了一个VideoRefer-Bench，全面评估Video LLM的空间-时间理解能力，跨多个方面进行评估。广泛的实验和分析表明，我们的VideoRefer 模型不仅在视频指代基准上取得了令人期待的性能，而且促进了一般视频理解能力。

English

Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. Specially, we thoroughly develop VideoRefer Suite across three essential aspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality object-level video instruction dataset, termed VideoRefer-700K. Next, we present the VideoRefer model, which equips a versatile spatial-temporal object encoder to capture precise regional and sequential representations. Finally, we meticulously create a VideoRefer-Bench to comprehensively assess the spatial-temporal understanding capability of a Video LLM, evaluating it across various aspects. Extensive experiments and analyses demonstrate that our VideoRefer model not only achieves promising performance on video referring benchmarks but also facilitates general video understanding capabilities.

VideoRefer套件：利用视频LLM推进时空对象理解

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

摘要

Summary

Support