Dispider：透過解耦感知、決策和反應，實現具有主動實時互動功能的視頻LLMs

摘要

與影片LLM的實時互動引入了一種新的人機交互範式，模型不僅理解用戶意圖，還在持續處理即時流式影片的同時作出回應。與離線影片LLM不同，後者在回答問題之前會分析整個影片，實時互動需要三項能力：1）感知：實時影片監控和互動捕捉。2）決策：在適當情況下提出主動互動。3）反應：與用戶進行持續互動。然而，所需能力之間存在固有的衝突。決策和反應需要相反的感知尺度和粒度，而自回歸解碼會在反應期間阻礙實時感知和決策。為了在一個和諧的系統中統一這些相互衝突的能力，我們提出了Dispider，一個能夠解開感知、決策和反應的系統。Dispider具有輕量級主動式流式影片處理模塊，可追蹤影片流並識別最佳互動時機。一旦觸發互動，異步互動模塊提供詳細回應，同時處理模塊繼續監控影片。我們的解開式和異步設計確保及時、情境準確和計算效率高的回應，使Dispider成為長時間影片流的活躍實時互動的理想選擇。實驗表明，Dispider不僅在傳統影片問答任務中保持較強性能，還明顯優於以往的在線模型在流式場景回應中，從而驗證了我們架構的有效性。代碼和模型已在https://github.com/Mark12Ding/Dispider 上發布。

English

Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that disentangles Perception, Decision, and Reaction. Dispider features a lightweight proactive streaming video processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses, thereby validating the effectiveness of our architecture. The code and model are released at https://github.com/Mark12Ding/Dispider.

Dispider：透過解耦感知、決策和反應，實現具有主動實時互動功能的視頻LLMs

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

摘要

Support