ビデオパンダ：エンコーダーフリーのためのパラメータ効率の良いアラインメント

要旨

我々は、効率的なエンコーダーフリーなアプローチを提案し、競争力のある性能を実現しつつ、計算オーバーヘッドを大幅に削減するビデオ言語理解手法を紹介します。現在のビデオ言語モデルは通常、重量級の画像エンコーダー（300M-1.1Bパラメータ）またはビデオエンコーダー（1B-1.4Bパラメータ）に依存しており、複数フレームのビデオを処理する際に膨大な計算負荷を生じさせます。当社の手法は、従来のアプローチと比較して、視覚処理に45Mパラメータのみを使用し、少なくとも6.5倍の削減を実現する新しい空間時間アライメントブロック（STAB）を導入します。STABアーキテクチャは、細かい特徴抽出のためのローカル空間時間エンコーディング、学習された注意を介した効率的な空間ダウンサンプリング、およびフレームレベルとビデオレベルの関係をモデリングするための別々のメカニズムを組み合わせています。当社のモデルは、標準ベンチマークでのオープンエンドのビデオ質問応答において、エンコーダーベースのアプローチと比較して同等または優れた性能を達成しています。細かい粒度のビデオ質問応答評価は、当社のモデルの効果を示し、エンコーダーベースのアプローチであるVideo-ChatGPTおよびVideo-LLaVAを、正確性や時間理解などの重要な側面で凌駕しています。包括的な削除研究は、当社のアーキテクチャの選択を検証し、従来の手法より3-4倍高速な処理速度を達成しつつ、当社の空間時間モデリングアプローチの効果を示しています。コードはhttps://github.com/jh-yi/Video-Pandaで入手可能です。

English

We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5times reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks. The fine-grained video question-answering evaluation demonstrates our model's effectiveness, outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key aspects like correctness and temporal understanding. Extensive ablation studies validate our architectural choices and demonstrate the effectiveness of our spatio-temporal modeling approach while achieving 3-4times faster processing speeds than previous methods. Code is available at https://github.com/jh-yi/Video-Panda.

ビデオパンダ：エンコーダーフリーのためのパラメータ効率の良いアラインメント

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

要旨

Summary

Support

Support