VISTA:透過影片時空增強來增進長時間和高解析度影片理解
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
December 1, 2024
作者: Weiming Ren, Huan Yang, Jie Min, Cong Wei, Wenhu Chen
cs.AI
摘要
目前的大型多模型(LMMs)在處理和理解長時間或高解析度視頻時面臨著重大挑戰,主要是由於缺乏高質量的數據集。為了從數據中心的角度解決這個問題,我們提出了VISTA,這是一個簡單而有效的視頻時空增強框架,從現有的視頻字幕數據集中合成長時間和高解析度的視頻指示-跟隨對。VISTA在空間和時間上結合視頻,創建新的合成視頻,其持續時間延長並增強解析度,並隨後生成與這些新合成視頻相關的問答對。基於這一範式,我們開發了七種視頻增強方法,並精心編輯了VISTA-400K,這是一個旨在增強長時間和高解析度視頻理解的視頻指示-跟隨數據集。在我們的數據上對各種視頻LMM進行微調,導致在四個具有挑戰性的長視頻理解基準上平均提高了3.3%。此外,我們還推出了第一個全面的高解析度視頻理解基準HRVideoBench,在這個基準上,我們微調的模型實現了6.5%的性能增益。這些結果突出了我們框架的有效性。
English
Current large multimodal models (LMMs) face significant challenges in
processing and comprehending long-duration or high-resolution videos, which is
mainly due to the lack of high-quality datasets. To address this issue from a
data-centric perspective, we propose VISTA, a simple yet effective Video
Spatiotemporal Augmentation framework that synthesizes long-duration and
high-resolution video instruction-following pairs from existing video-caption
datasets. VISTA spatially and temporally combines videos to create new
synthetic videos with extended durations and enhanced resolutions, and
subsequently produces question-answer pairs pertaining to these newly
synthesized videos. Based on this paradigm, we develop seven video augmentation
methods and curate VISTA-400K, a video instruction-following dataset aimed at
enhancing long-duration and high-resolution video understanding. Finetuning
various video LMMs on our data resulted in an average improvement of 3.3%
across four challenging benchmarks for long-video understanding. Furthermore,
we introduce the first comprehensive high-resolution video understanding
benchmark HRVideoBench, on which our finetuned models achieve a 6.5%
performance gain. These results highlight the effectiveness of our framework.Summary
AI-Generated Summary