基于时序动态上下文的多模态长视频建模

摘要

近期，大型语言模型（LLMs）的进展在视频理解领域取得了显著突破。然而，现有模型在处理长视频时仍面临挑战，主要受限于LLMs的上下文长度限制以及视频中蕴含的庞大数据量。尽管一些最新方法专为长视频理解设计，但在令牌压缩过程中往往丢失关键信息，且难以处理音频等额外模态。本研究中，我们提出了一种利用帧间时间关系的动态长视频编码方法，称为时间动态上下文（Temporal Dynamic Context, TDC）。首先，我们基于帧间相似性将视频分割为语义一致的场景，随后通过视觉-音频编码器将每帧编码为令牌。其次，我们引入了一种新颖的时间上下文压缩器，以减少每段视频内的令牌数量。具体而言，我们采用基于查询的Transformer，将视频、音频及指令文本令牌聚合为一组有限的时间上下文令牌。最后，我们将静态帧令牌与时间上下文令牌输入LLM进行视频理解。此外，针对极长视频，我们提出了一种无需训练的思维链策略，逐步从多个视频片段中提取答案。这些中间答案作为推理过程的一部分，共同构成最终答案。我们在通用视频理解及音视频理解基准上进行了广泛实验，结果表明我们的方法表现优异。代码与模型已发布于https://github.com/Hoar012/TDC-Video。

English

Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.

基于时序动态上下文的多模态长视频建模

Multimodal Long Video Modeling Based on Temporal Dynamic Context

摘要

Summary

Support

Support