統一專業視覺編碼器用於視頻語言模型

Unifying Specialized Visual Encoders for Video Language Models

January 2, 2025
作者: Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, Olga Russakovsky
cs.AI

摘要

近來,大型語言模型(LLMs)的出現為影片領域帶來了複雜的推理能力,透過影片大型語言模型(VideoLLMs)。然而,VideoLLMs目前依賴單一視覺編碼器進行所有視覺處理,這限制了可以傳達給LLM的視覺信息的數量和類型。我們的方法,MERV,即多編碼器視頻表示,取而代之利用多個凍結的視覺編碼器來創建視頻的統一表示,為VideoLLM提供全面的專業視覺知識。從每個編碼器中對特徵進行時空對齊,使我們能夠應對更廣泛的開放式和多選視頻理解問題,並且優於先前的最新工作。在標準套件視頻理解基準測試中,MERV的準確性比Video-LLaVA提高了高達3.7%,同時還具有更好的Video-ChatGPT分數。我們還改進了SeViLA,在零-shot感知測試準確性方面的先前最佳記錄,提高了2.2%。MERV引入了最少的額外參數,並且比等效單一編碼器方法更快地訓練,同時實現視覺處理的並行化。最後,我們提供定性證據表明MERV成功地從每個編碼器中捕獲領域知識。我們的結果為利用多個視覺編碼器進行全面視頻理解提供了有前途的方向。
English
The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works. MERV is up to 3.7% better in accuracy than Video-LLaVA across the standard suite video understanding benchmarks, while also having a better Video-ChatGPT score. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2.2%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder methods while parallelizing the visual processing. Finally, we provide qualitative evidence that MERV successfully captures domain knowledge from each of its encoders. Our results offer promising directions in utilizing multiple vision encoders for comprehensive video understanding.

Summary

AI-Generated Summary

PDF212January 3, 2025