ChatPaper.aiChatPaper

基於時序動態上下文的多模態長視頻建模

Multimodal Long Video Modeling Based on Temporal Dynamic Context

April 14, 2025
作者: Haoran Hao, Jiaming Han, Yiyuan Zhang, Xiangyu Yue
cs.AI

摘要

近期大型語言模型(LLMs)的進展在視頻理解領域取得了重大突破。然而,由於LLMs的上下文長度限制以及視頻中信息量龐大,現有模型在處理長視頻時仍面臨挑戰。儘管一些最新方法專為長視頻理解設計,但它們在令牌壓縮過程中往往會丟失關鍵信息,並且難以處理如音頻等額外模態。在本研究中,我們提出了一種利用幀間時間關係的動態長視頻編碼方法,稱為時間動態上下文(TDC)。首先,我們根據幀間相似性將視頻分割為語義一致的場景,然後使用視覺-音頻編碼器將每幀編碼為令牌。其次,我們提出了一種新穎的時間上下文壓縮器,以減少每個片段內的令牌數量。具體而言,我們採用基於查詢的Transformer將視頻、音頻和指令文本令牌聚合為一組有限的時間上下文令牌。最後,我們將靜態幀令牌和時間上下文令牌輸入LLM進行視頻理解。此外,為處理極長視頻,我們提出了一種無需訓練的思維鏈策略,逐步從多個視頻片段中提取答案。這些中間答案作為推理過程的一部分,並有助於最終答案的形成。我們在通用視頻理解和音頻-視頻理解基準上進行了廣泛實驗,結果表明我們的方法表現出色。代碼和模型可在https://github.com/Hoar012/TDC-Video獲取。
English
Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.

Summary

AI-Generated Summary

PDF32April 16, 2025