对视频进行时间定位,就像翻阅漫画一样。
Number it: Temporal Grounding Videos like Flipping Manga
November 15, 2024
作者: Yongliang Wu, Xinting Hu, Yuyang Sun, Yizhou Zhou, Wenbo Zhu, Fengyun Rao, Bernt Schiele, Xu Yang
cs.AI
摘要
视频大型语言模型(Vid-LLMs)在理解视频内容以进行问答对话方面取得了显著进展。然而,它们在将这种视觉理解扩展到需要精确时间定位的任务上(即视频时间定位,VTG)时遇到了困难。为了解决这一问题,我们引入了Number-Prompt(NumPro),这是一种新颖的方法,通过为每个视频帧添加独特的数字标识符,使Vid-LLMs能够将视觉理解与时间定位相结合。将视频视为一系列带有编号的帧图像,NumPro将VTG转化为一种直观的过程:按顺序翻阅漫画面板。这使得Vid-LLMs能够“阅读”事件时间线,准确地将视觉内容与相应的时间信息联系起来。我们的实验证明,NumPro显著提升了顶尖Vid-LLMs的VTG性能,而无需额外的计算成本。此外,在NumPro增强的数据集上进行微调,为VTG定义了一个新的最先进水平,mIoU在时刻检索方面超过以往表现最好的方法高达6.9%,在突出部分检测方面高达8.5%。代码将在https://github.com/yongliang-wu/NumPro 上提供。
English
Video Large Language Models (Vid-LLMs) have made remarkable advancements in
comprehending video content for QA dialogue. However, they struggle to extend
this visual understanding to tasks requiring precise temporal localization,
known as Video Temporal Grounding (VTG). To address this gap, we introduce
Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual
comprehension with temporal grounding by adding unique numerical identifiers to
each video frame. Treating a video as a sequence of numbered frame images,
NumPro transforms VTG into an intuitive process: flipping through manga panels
in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking
visual content with corresponding temporal information. Our experiments
demonstrate that NumPro significantly boosts VTG performance of top-tier
Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a
NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing
previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and
8.5\% in mAP for highlight detection. The code will be available at
https://github.com/yongliang-wu/NumPro.Summary
AI-Generated Summary