Dispider:通过解耦感知、决策和反应,实现视频LLMs的主动实时交互

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

January 6, 2025
作者: Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
cs.AI

摘要

视频LLM的实时互动引入了一种新的人机交互范式,模型不仅理解用户意图,而且在持续处理实时流视频的同时做出响应。与离线视频LLM不同,后者在回答问题之前会分析整个视频,实时互动则需要三种能力:1)感知:实时视频监控和交互捕捉。2)决策:在适当情况下提出主动交互。3)反应:与用户持续交互。然而,所需能力之间存在固有冲突。决策和反应需要相反的感知尺度和粒度,自回归解码在反应期间会阻碍实时感知和决策。为了在一个和谐系统中统一冲突的能力,我们提出了Dispider,一个系统,它解开了感知、决策和反应。Dispider具有轻量级的主动流视频处理模块,可跟踪视频流并确定最佳交互时机。一旦交互被触发,异步交互模块提供详细响应,同时处理模块继续监视视频。我们的解开和异步设计确保及时、情境准确和计算高效的响应,使Dispider成为处理长时间视频流的理想实时互动工具。实验证明Dispider不仅在传统视频问答任务中表现出色,而且在流媒体场景响应方面显著超越以往的在线模型,从而验证了我们架构的有效性。代码和模型已发布在https://github.com/Mark12Ding/Dispider。
English
Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that disentangles Perception, Decision, and Reaction. Dispider features a lightweight proactive streaming video processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses, thereby validating the effectiveness of our architecture. The code and model are released at https://github.com/Mark12Ding/Dispider.

Summary

AI-Generated Summary

PDF333January 7, 2025