OmniMMI：流媒体视频场景下的多模态交互综合基准测试

摘要

多模态语言模型（MLLMs）如GPT-4o的快速发展，推动了全能语言模型的进步，这些模型旨在处理并主动响应连续的多模态数据流。尽管潜力巨大，在流媒体视频场景中评估其实际交互能力仍是一项艰巨挑战。本研究中，我们推出了OmniMMI，一个专为流媒体视频场景下的全能语言模型（OmniLLMs）量身定制的综合多模态交互基准。OmniMMI囊括了超过1,121个视频和2,290个问题，针对现有视频基准中两个关键但尚未充分探索的挑战：流媒体视频理解与主动推理，覆盖了六个不同的子任务。此外，我们提出了一种新颖的框架——多模态多路复用建模（M4），旨在实现一个推理高效的流媒体模型，该模型能够在生成过程中同时进行视觉与听觉处理。

English

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

OmniMMI：流媒体视频场景下的多模态交互综合基准测试

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

摘要

Summary

Support

Support