ChatPaper.aiChatPaper

VISTA:通过视频时空增强来增强长时间和高分辨率视频理解

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

December 1, 2024
作者: Weiming Ren, Huan Yang, Jie Min, Cong Wei, Wenhu Chen
cs.AI

摘要

当前的大型多模态模型(LMMs)在处理和理解长时间或高分辨率视频时面临重大挑战,主要原因是缺乏高质量的数据集。为了从数据中心的角度解决这一问题,我们提出了VISTA,这是一个简单而有效的视频时空增强框架,可以从现有的视频字幕数据集中合成长时间和高分辨率的视频指令-跟随对。VISTA在空间和时间上结合视频,创建新的合成视频,具有延长的持续时间和增强的分辨率,随后生成与这些新合成视频相关的问题-答案对。基于这一范式,我们开发了七种视频增强方法,并策划了VISTA-400K,这是一个旨在增强长时间和高分辨率视频理解的视频指令-跟随数据集。在我们的数据上微调各种视频LMMs导致在长视频理解的四个具有挑战性的基准测试中平均提高了3.3%。此外,我们引入了第一个全面的高分辨率视频理解基准测试HRVideoBench,在这一基准测试上,我们微调的模型实现了6.5%的性能提升。这些结果突显了我们框架的有效性。
English
Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework.

Summary

AI-Generated Summary

PDF282December 3, 2024