流媒體聊天:與流媒體視頻聊天

StreamChat: Chatting with Streaming Video

December 11, 2024
作者: Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, Jose M. Alvare
cs.AI

摘要

本文介紹了StreamChat,一種新方法,可增強大型多模態模型(LMMs)與串流視頻內容的互動能力。在串流互動場景中,現有方法僅依賴提問時刻可用的視覺信息,這導致模型對串流視頻後續變化一無所知,進而造成顯著延遲。StreamChat通過在每個解碼步驟中創新地更新視覺上下文,解決了這一限制,確保模型在整個解碼過程中利用最新的視頻內容。此外,我們引入了一種靈活高效的基於交叉注意力的架構,用於處理動態串流輸入,同時保持對串流互動的推理效率。此外,我們構建了一個新的密集指令數據集,以促進串流互動模型的訓練,並配合一個平行的3D-RoPE機制,將視覺和文本標記的相對時間信息進行編碼。實驗結果表明,StreamChat在已建立的圖像和視頻基準測試中取得了競爭性表現,在串流互動場景中展現出比最先進的視頻LMM更優越的能力。
English
This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models, complemented by a parallel 3D-RoPE mechanism that encodes the relative temporal information of visual and text tokens. Experimental results demonstrate that StreamChat achieves competitive performance on established image and video benchmarks and exhibits superior capabilities in streaming interaction scenarios compared to state-of-the-art video LMM.

Summary

AI-Generated Summary

PDF182December 12, 2024