ChatPaper.aiChatPaper

視覺上下文窗口擴展:對於長視頻理解的新觀點

Visual Context Window Extension: A New Perspective for Long Video Understanding

September 30, 2024
作者: Hongchen Wei, Zhenzhong Chen
cs.AI

摘要

大型多模型(LMMs)在短视频理解任務中展現出令人印象深刻的性能,但在應用於長視頻理解時面臨巨大挑戰。相比之下,大型語言模型(LLMs)在建模長文本方面表現出色。現有研究試圖通過在訓練期間引入長視頻文本對來解決此問題。然而,這些方法需要大量的計算和數據資源。本文從上下文窗口的角度來應對長視頻理解的挑戰,旨在將LMMs應用於長視頻任務,而無需對長視頻數據集進行重新訓練。我們首先深入分析了預訓練的LMMs為何難以理解冗長的視頻內容,發現視覺和語言模態之間的差異導致視覺和語言標記具有不同的上下文窗口,使得直接擴展視覺標記以匹配語言上下文窗口變得困難。基於此,我們提出通過擴展視覺上下文窗口來適應長視頻理解任務,消除對大規模長視頻數據集進行重新訓練的需求。為了進一步減輕由長序列引起的顯著內存消耗,我們引入了一種漸進式池化推理策略,通過選擇性調整幀嵌入的空間分辨率,減少視覺標記的數量,同時保留重要的空間信息。在多個長視頻理解基準測試中,我們的方法在視頻幀數增加時持續改善性能。在MLVU基準測試中,即使我們的模型大小僅為7B,我們的方法也優於GPT-4o。此外,在256幀設置中,我們的方法將內存使用量與基準相比約降低45%,而不會引入任何性能損失。
English
Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. However, these approaches require substantial computational and data resources. In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. We first conduct an in-depth analysis of why pretrained LMMs struggle to understand lengthy video content, identifying that discrepancies between visual and language modalities lead to different context windows for visual and language tokens, making it difficult to directly extend the visual tokens to match the language context window. Based on this, we propose to adapt LMMs for long video understanding tasks by extending the visual context window, eliminating the need for retraining on large scalelong video datasets. To further mitigate the significant memory consumption caused by long sequences, we introduce a progressive pooling inference strategy that selectively adjusts the spatial resolution of frame embeddings, reducing the number of visual tokens while retaining important spatial information. Across multiple long video understanding benchmarks, our method consistently improves the performance as the number of video frames increases. On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B. Additionally, in the 256-frame setting, our method reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.

Summary

AI-Generated Summary

PDF112November 13, 2024