VideoLLM는 말할 시기를 알고 있습니다: 비디오-텍스트 듀엣 상호작용 형식을 활용한 시간에 민감한 비디오 이해력 향상

초록

최근 비디오 대규모 언어 모델(VideoLLM)에 대한 연구는 주로 모델 아키텍처와 학습 데이터셋에 초점을 맞추고 사용자와 모델 간 상호작용 형식을 충분히 탐구하지 않았습니다. 기존 연구에서 사용자는 주로 전체 비디오와 쿼리를 입력으로 사용하여 VideoLLM과 상호작용하며, 그 후 모델이 응답을 생성합니다. 이 상호작용 형식은 비디오가 끝나지 않고 실시간으로 응답이 필요한 라이브 스트리밍 이해와 같은 시나리오에서 VideoLLM의 응용을 제한하며, 또한 비디오 세그먼트의 위치를 지정하는 시간에 민감한 작업에서 성능이 불만족스럽게 나타납니다. 본 논문에서는 비디오-텍스트 듀엣 상호작용 형식에 초점을 맞추었습니다. 이 상호작용 형식은 비디오의 연속 재생을 특징으로 하며, 사용자와 모델은 비디오 재생 중 어느 위치에서든 텍스트 메시지를 삽입할 수 있습니다. 텍스트 메시지가 끝나면 비디오는 계속 재생되며, 이는 듀엣에서 두 명의 연주자의 대안과 유사합니다. VideoLLM을 비디오-텍스트 듀엣 상호작용 형식에 적응시키기 위해 설계된 비디오-텍스트 학습 데이터셋인 MMDuetIT을 구축했습니다. 또한 실시간 응답 능력을 벤치마킹하기 위해 Multi-Answer Grounded Video Question Answering (MAGQA) 작업을 소개했습니다. MMDuetIT에서 훈련된 MMDuet은 비디오-텍스트 듀엣 상호작용 형식을 채택함으로써 다양한 시간에 민감한 작업에서 상당한 성능 향상을 달성할 수 있음을 보여주며(YouCook2 밀집 비디오 캡션에서 76% CIDEr, QVHighlights 하이라이트 감지에서 90% mAP, Charades-STA 시간적 비디오 지원에서 25% R@0.5), 최소한의 훈련 노력으로도 VideoLLM이 비디오가 재생되는 동안 실시간으로 응답할 수 있도록 합니다. 코드, 데이터 및 데모는 다음에서 확인할 수 있습니다: https://github.com/yellow-binary-tree/MMDuet.

English

Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights highlight detection and 25% R@0.5 on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays. Code, data and demo are available at: https://github.com/yellow-binary-tree/MMDuet.

VideoLLM는 말할 시기를 알고 있습니다: 비디오-텍스트 듀엣 상호작용 형식을 활용한 시간에 민감한 비디오 이해력 향상

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

초록

Summary

Support