Video-Panda:對於無編碼器的視頻語言模型的參數高效對齊
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
December 24, 2024
作者: Jinhui Yi, Syed Talal Wasim, Yanan Luo, Muzammal Naseer, Juergen Gall
cs.AI
摘要
我們提出了一種高效的無編碼器方法,用於視頻語言理解,實現了具有競爭力的性能,同時顯著減少了計算開銷。當前的視頻語言模型通常依賴於龐大的圖像編碼器(3億至11億參數)或視頻編碼器(10億至14億參數),在處理多幀視頻時造成了重大的計算負擔。我們的方法引入了一種新穎的時空對齊塊(STAB),可以直接處理視頻輸入,而無需預先訓練的編碼器,同時僅使用4500萬參數進行視覺處理 - 與傳統方法相比至少減少了6.5倍。STAB架構結合了局部時空編碼,用於精細特徵提取,通過學習的注意機制實現了高效的空間下採樣,並使用獨立機制來建模幀級和視頻級別的關係。我們的模型在標準基準上實現了與基於編碼器方法相當或更優的性能,用於開放式視頻問答。精細的視頻問答評估展示了我們模型的有效性,在正確性和時間理解等關鍵方面優於基於編碼器的方法Video-ChatGPT和Video-LLaVA。大量消融研究驗證了我們的架構選擇,展示了我們時空建模方法的有效性,同時實現了比以前方法快3-4倍的處理速度。代碼可在https://github.com/jh-yi/Video-Panda找到。
English
We present an efficient encoder-free approach for video-language
understanding that achieves competitive performance while significantly
reducing computational overhead. Current video-language models typically rely
on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B
parameters), creating a substantial computational burden when processing
multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment
Block (STAB) that directly processes video inputs without requiring pre-trained
encoders while using only 45M parameters for visual processing - at least a
6.5times reduction compared to traditional approaches. The STAB architecture
combines Local Spatio-Temporal Encoding for fine-grained feature extraction,
efficient spatial downsampling through learned attention and separate
mechanisms for modeling frame-level and video-level relationships. Our model
achieves comparable or superior performance to encoder-based approaches for
open-ended video question answering on standard benchmarks. The fine-grained
video question-answering evaluation demonstrates our model's effectiveness,
outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key
aspects like correctness and temporal understanding. Extensive ablation studies
validate our architectural choices and demonstrate the effectiveness of our
spatio-temporal modeling approach while achieving 3-4times faster processing
speeds than previous methods. Code is available at
https://github.com/jh-yi/Video-Panda.Summary
AI-Generated Summary