ChatPaper.aiChatPaper

LongVU:針對長視頻語言理解的時空自適應壓縮

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

October 22, 2024
作者: Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra
cs.AI

摘要

多模式大型語言模型(MLLMs)在理解和分析視頻內容方面取得了令人鼓舞的進展。然而,處理長視頻仍然是一個顯著的挑戰,受到LLM上下文大小的限制。為了解決這一限制,我們提出了LongVU,一種時空自適應壓縮機制,可以減少視頻標記的數量,同時保留長視頻的視覺細節。我們的想法基於利用跨模態查詢和幀間依賴性,以自適應方式減少視頻中的時間和空間冗餘。具體來說,我們利用DINOv2特徵來刪除具有高相似性的冗餘幀。然後,我們利用文本引導的跨模態查詢來進行選擇性幀特徵減少。此外,我們根據它們的時間依賴性在幀之間進行空間標記減少。我們的自適應壓縮策略可以在給定上下文長度內有效處理大量幀,幾乎沒有視覺信息損失。我們的LongVU在各種視頻理解基準測試中始終優於現有方法,特別是在長達一小時的視頻理解任務(如VideoMME和MLVU)方面。在給定輕量級LLM的情況下,我們的LongVU還可以有效地擴展到更小的尺寸,並實現最先進的視頻理解性能。
English
Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

Summary

AI-Generated Summary

PDF292November 16, 2024