스트림챗: 스트리밍 비디오와의 채팅

초록

본 논문은 StreamChat을 제시하는데, 이는 스트리밍 비디오 콘텐츠와 대형 다중모달 모델(LMMs)의 상호작용 능력을 향상시키는 혁신적인 방법론을 소개합니다. 스트리밍 상호작용 시나리오에서 기존 방법은 질문이 제기될 때의 시각적 정보에만 의존하여, 스트리밍 비디오의 이후 변경 사항에 대해 모델이 미각할 때까지 지연이 발생합니다. StreamChat은 이 한계를 극복하기 위해 각 디코딩 단계에서 시각적 맥락을 혁신적으로 업데이트하여, 모델이 디코딩 과정 전반에 걸쳐 최신 비디오 콘텐츠를 활용하도록 보장합니다. 더불어, 우리는 동적 스트리밍 입력을 처리하는 유연하고 효율적인 크로스어텐션 기반 아키텍처를 소개하여, 스트리밍 상호작용에 대한 추론 효율성을 유지합니다. 게다가, 스트리밍 상호작용 모델의 훈련을 용이하게 하는 새로운 밀집 지시 데이터셋을 구축하였으며, 시각적 및 텍스트 토큰의 상대적 시간 정보를 인코딩하는 병렬 3D-RoPE 메커니즘을 보완하였습니다. 실험 결과는 StreamChat이 이미지 및 비디오 벤치마크에서 경쟁력 있는 성능을 달성하며, 최첨단 비디오 LMM에 비해 스트리밍 상호작용 시나리오에서 우수한 능력을 나타낸다는 것을 입증합니다.

English

This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models, complemented by a parallel 3D-RoPE mechanism that encodes the relative temporal information of visual and text tokens. Experimental results demonstrate that StreamChat achieves competitive performance on established image and video benchmarks and exhibits superior capabilities in streaming interaction scenarios compared to state-of-the-art video LMM.

스트림챗: 스트리밍 비디오와의 채팅

StreamChat: Chatting with Streaming Video

초록

Summary

Support