流媒体聊天:与流式视频交流
StreamChat: Chatting with Streaming Video
December 11, 2024
作者: Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, Jose M. Alvare
cs.AI
摘要
本文介绍了StreamChat,这是一种新颖的方法,通过流式视频内容增强大型多模态模型(LMMs)的交互能力。在流式交互场景中,现有方法仅依赖于在提出问题时可用的视觉信息,导致模型在不知晓流式视频后续变化的情况下产生显著延迟。StreamChat通过创新地在每个解码步骤更新视觉上下文来解决这一限制,确保模型在整个解码过程中利用最新的视频内容。此外,我们引入了一种灵活高效的基于交叉注意力的架构,用于处理动态流式输入,同时保持流式交互的推理效率。此外,我们构建了一个新的密集指令数据集,以促进流式交互模型的训练,配以一个并行的3D-RoPE机制,编码视觉和文本标记的相对时间信息。实验结果表明,StreamChat在已建立的图像和视频基准测试中取得了竞争性表现,并在流式交互场景中表现出比最先进的视频LMM更优越的能力。
English
This paper presents StreamChat, a novel approach that enhances the
interaction capabilities of Large Multimodal Models (LMMs) with streaming video
content. In streaming interaction scenarios, existing methods rely solely on
visual information available at the moment a question is posed, resulting in
significant delays as the model remains unaware of subsequent changes in the
streaming video. StreamChat addresses this limitation by innovatively updating
the visual context at each decoding step, ensuring that the model utilizes
up-to-date video content throughout the decoding process. Additionally, we
introduce a flexible and efficient crossattention-based architecture to process
dynamic streaming inputs while maintaining inference efficiency for streaming
interactions. Furthermore, we construct a new dense instruction dataset to
facilitate the training of streaming interaction models, complemented by a
parallel 3D-RoPE mechanism that encodes the relative temporal information of
visual and text tokens. Experimental results demonstrate that StreamChat
achieves competitive performance on established image and video benchmarks and
exhibits superior capabilities in streaming interaction scenarios compared to
state-of-the-art video LMM.Summary
AI-Generated Summary