將其編號:像翻轉漫畫那樣對視頻進行時間定位

Number it: Temporal Grounding Videos like Flipping Manga

November 15, 2024
作者: Yongliang Wu, Xinting Hu, Yuyang Sun, Yizhou Zhou, Wenbo Zhu, Fengyun Rao, Bernt Schiele, Xu Yang
cs.AI

摘要

視頻大型語言模型(Vid-LLMs)在理解視頻內容以進行問答對話方面取得了顯著進展。然而,它們在將這種視覺理解延伸到需要精確時間定位的任務上(即視頻時間定位,VTG)時遇到困難。為了解決這一問題,我們引入了一種名為Number-Prompt(NumPro)的新方法,該方法通過為每個視頻幀添加獨特的數字標識符號,使Vid-LLMs能夠將視覺理解與時間定位相結合。NumPro將視頻視為一系列編號幀圖像,將VTG轉化為一個直觀的過程:按順序瀏覽漫畫面板。這使Vid-LLMs能夠“閱讀”事件時間軸,準確地將視覺內容與相應的時間信息相關聯。我們的實驗表明,NumPro顯著提升了頂級Vid-LLMs的VTG性能,而無需額外的計算成本。此外,在NumPro增強的數據集上進行微調定義了VTG的新最先進水平,其在時刻檢索的mIoU方面超越了先前表現最優方法高達6.9%,在突出部分檢測的mAP方面超過了8.5%。代碼將可在https://github.com/yongliang-wu/NumPro 上找到。
English
Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and 8.5\% in mAP for highlight detection. The code will be available at https://github.com/yongliang-wu/NumPro.

Summary

AI-Generated Summary

PDF142November 18, 2024