VideoLLM 知道何時發聲:透過影片-文字二重互動格式增強時效性影片理解
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
November 27, 2024
作者: Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, Dongyan Zhao
cs.AI
摘要
近期對於影片大型語言模型(VideoLLM)的研究主要聚焦於模型架構和訓練數據集,而對使用者與模型之間的互動格式則尚未深入探討。在現有研究中,使用者通常透過整個影片和查詢作為輸入與VideoLLMs進行互動,隨後模型生成回應。這種互動格式限制了VideoLLMs在諸如直播理解等場景中的應用,其中影片不會結束且需要即時回應,同時導致在需要定位影片片段的時間敏感任務上表現不佳。本文專注於影片文本二重奏互動格式。這種互動格式的特點是影片的連續播放,使用者和模型都可以在影片播放期間的任何位置插入他們的文本消息。當文本消息結束時,影片繼續播放,類似於二位表演者進行二重奏的方式。我們建立了MMDuetIT,一個旨在使VideoLLMs適應影片文本二重奏互動格式的影片文本訓練數據集。我們還引入了多答案基於影片的問答(MAGQA)任務,以評估VideoLLMs的實時回應能力。在MMDuetIT上訓練後,MMDuet表明採用影片文本二重奏互動格式使模型在各種時間敏感任務上實現顯著改進(YouCook2密集影片字幕的76% CIDEr,QVHighlights亮點檢測的90% mAP和Charades-STA時間影片定位的25% R@0.5),並且使VideoLLMs能夠在影片播放時以實時方式回覆。代碼、數據和演示可在以下鏈接找到:https://github.com/yellow-binary-tree/MMDuet。
English
Recent researches on video large language models (VideoLLM) predominantly
focus on model architectures and training datasets, leaving the interaction
format between the user and the model under-explored. In existing works, users
often interact with VideoLLMs by using the entire video and a query as input,
after which the model generates a response. This interaction format constrains
the application of VideoLLMs in scenarios such as live-streaming comprehension
where videos do not end and responses are required in a real-time manner, and
also results in unsatisfactory performance on time-sensitive tasks that
requires localizing video segments. In this paper, we focus on a video-text
duet interaction format. This interaction format is characterized by the
continuous playback of the video, and both the user and the model can insert
their text messages at any position during the video playback. When a text
message ends, the video continues to play, akin to the alternative of two
performers in a duet. We construct MMDuetIT, a video-text training dataset
designed to adapt VideoLLMs to video-text duet interaction format. We also
introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to
benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT,
MMDuet demonstrates that adopting the video-text duet interaction format
enables the model to achieve significant improvements in various time-sensitive
tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights
highlight detection and 25% R@0.5 on Charades-STA temporal video grounding)
with minimal training efforts, and also enable VideoLLMs to reply in a
real-time manner as the video plays. Code, data and demo are available at:
https://github.com/yellow-binary-tree/MMDuet.Summary
AI-Generated Summary