OmniMMI:流媒体视频场景下的多模态交互综合基准测试
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
March 29, 2025
作者: Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, Zilong Zheng
cs.AI
摘要
多模态语言模型(MLLMs)如GPT-4o的快速发展,推动了全能语言模型的进步,这些模型旨在处理并主动响应连续的多模态数据流。尽管潜力巨大,在流媒体视频场景中评估其实际交互能力仍是一项艰巨挑战。本研究中,我们推出了OmniMMI,一个专为流媒体视频场景下的全能语言模型(OmniLLMs)量身定制的综合多模态交互基准。OmniMMI囊括了超过1,121个视频和2,290个问题,针对现有视频基准中两个关键但尚未充分探索的挑战:流媒体视频理解与主动推理,覆盖了六个不同的子任务。此外,我们提出了一种新颖的框架——多模态多路复用建模(M4),旨在实现一个推理高效的流媒体模型,该模型能够在生成过程中同时进行视觉与听觉处理。
English
The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has
propelled the development of Omni language models, designed to process and
proactively respond to continuous streams of multi-modal data. Despite their
potential, evaluating their real-world interactive capabilities in streaming
video contexts remains a formidable challenge. In this work, we introduce
OmniMMI, a comprehensive multi-modal interaction benchmark tailored for
OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and
2,290 questions, addressing two critical yet underexplored challenges in
existing video benchmarks: streaming video understanding and proactive
reasoning, across six distinct subtasks. Moreover, we propose a novel
framework, Multi-modal Multiplexing Modeling (M4), designed to enable an
inference-efficient streaming model that can see, listen while generating.Summary
AI-Generated Summary