VideoRefer 套件:透過 Video LLM 推進時空物件理解
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
December 31, 2024
作者: Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing
cs.AI
摘要
最近,視頻大型語言模型(Video LLMs)展示了在一般視頻理解方面的卓越能力。然而,它們主要專注於整體理解,並且在捕捉細粒度空間和時間細節方面遇到困難。此外,缺乏高質量的物體級視頻指導數據和全面的基準進一步阻礙了它們的發展。為應對這些挑戰,我們引入了VideoRefer Suite,以增強Video LLM對更細緻的空間-時間視頻理解,即實現對整個視頻中任何物體的感知和推理。特別是,我們從三個基本方面全面發展了VideoRefer Suite:數據集、模型和基準。首先,我們引入了一個多智能體數據引擎,精心策劃了一個大規模、高質量的物體級視頻指導數據集,稱為VideoRefer-700K。接下來,我們提出了VideoRefer模型,它配備了一個多功能的空間-時間物體編碼器,以捕捉精確的區域和序列表示。最後,我們精心創建了一個VideoRefer-Bench,全面評估Video LLM的空間-時間理解能力,跨多個方面進行評估。廣泛的實驗和分析表明,我們的VideoRefer模型不僅在視頻參考基準上取得了令人期待的表現,還促進了一般視頻理解能力。
English
Video Large Language Models (Video LLMs) have recently exhibited remarkable
capabilities in general video understanding. However, they mainly focus on
holistic comprehension and struggle with capturing fine-grained spatial and
temporal details. Besides, the lack of high-quality object-level video
instruction data and a comprehensive benchmark further hinders their
advancements. To tackle these challenges, we introduce the VideoRefer Suite to
empower Video LLM for finer-level spatial-temporal video understanding, i.e.,
enabling perception and reasoning on any objects throughout the video.
Specially, we thoroughly develop VideoRefer Suite across three essential
aspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent
data engine to meticulously curate a large-scale, high-quality object-level
video instruction dataset, termed VideoRefer-700K. Next, we present the
VideoRefer model, which equips a versatile spatial-temporal object encoder to
capture precise regional and sequential representations. Finally, we
meticulously create a VideoRefer-Bench to comprehensively assess the
spatial-temporal understanding capability of a Video LLM, evaluating it across
various aspects. Extensive experiments and analyses demonstrate that our
VideoRefer model not only achieves promising performance on video referring
benchmarks but also facilitates general video understanding capabilities.Summary
AI-Generated Summary