VideoLLM 知道何时发言:通过视频文本二重交互格式增强时效视频理解
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
November 27, 2024
作者: Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, Dongyan Zhao
cs.AI
摘要
最近关于视频大型语言模型(VideoLLM)的研究主要集中在模型架构和训练数据集上,而用户与模型之间的交互格式则鲜有探讨。在现有研究中,用户通常通过将整个视频和一个查询作为输入与VideoLLMs进行交互,随后模型生成响应。这种交互格式限制了VideoLLMs在诸如直播理解等场景中的应用,其中视频不会结束且需要实时响应,同时也导致在需要定位视频片段的时间敏感任务上表现不佳。本文着重于视频文本二重奏交互格式。这种交互格式的特点是视频的连续播放,用户和模型都可以在视频播放过程中的任何位置插入文本消息。当文本消息结束时,视频继续播放,类似于二位表演者的对唱。我们构建了MMDuetIT,一个旨在使VideoLLMs适应视频文本二重奏交互格式的视频文本训练数据集。我们还引入了多答案基于视频的问答(MAGQA)任务,以评估VideoLLMs的实时响应能力。在MMDuetIT上训练后,MMDuet表明采用视频文本二重奏交互格式使模型在各种时间敏感任务中取得了显著改进(YouCook2密集视频字幕的76% CIDEr,QVHighlights亮点检测的90% mAP和Charades-STA时间视频定位的25% [email protected]),并且还使VideoLLMs能够在视频播放时实时回复。代码、数据和演示可在以下链接找到:https://github.com/yellow-binary-tree/MMDuet。
English
Recent researches on video large language models (VideoLLM) predominantly
focus on model architectures and training datasets, leaving the interaction
format between the user and the model under-explored. In existing works, users
often interact with VideoLLMs by using the entire video and a query as input,
after which the model generates a response. This interaction format constrains
the application of VideoLLMs in scenarios such as live-streaming comprehension
where videos do not end and responses are required in a real-time manner, and
also results in unsatisfactory performance on time-sensitive tasks that
requires localizing video segments. In this paper, we focus on a video-text
duet interaction format. This interaction format is characterized by the
continuous playback of the video, and both the user and the model can insert
their text messages at any position during the video playback. When a text
message ends, the video continues to play, akin to the alternative of two
performers in a duet. We construct MMDuetIT, a video-text training dataset
designed to adapt VideoLLMs to video-text duet interaction format. We also
introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to
benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT,
MMDuet demonstrates that adopting the video-text duet interaction format
enables the model to achieve significant improvements in various time-sensitive
tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights
highlight detection and 25% [email protected] on Charades-STA temporal video grounding)
with minimal training efforts, and also enable VideoLLMs to reply in a
real-time manner as the video plays. Code, data and demo are available at:
https://github.com/yellow-binary-tree/MMDuet.Summary
AI-Generated Summary