비디오-판다: 인코더 없이 파라미터 효율적 정렬을 위한 비디오-언어 모델

초록

우리는 비디오-언어 이해를 위한 효율적인 인코더 없는 접근 방식을 제시하며, 경쟁력 있는 성능을 달성하면서 계산 부담을 크게 줄였습니다. 현재의 비디오-언어 모델은 일반적으로 무거운 이미지 인코더(300M-1.1B 매개변수) 또는 비디오 인코더(1B-1.4B 매개변수)에 의존하는데, 이는 다중 프레임 비디오를 처리할 때 상당한 계산 부담을 초래합니다. 우리의 방법은 새로운 시공간 정렬 블록(STAB)을 도입하여, 사전 훈련된 인코더를 필요로 하지 않으면서 시각 처리를 위해 단 45M 매개변수만 사용합니다 - 기존 방법에 비해 최소 6.5배 이상의 감소입니다. STAB 아키텍처는 미세한 특징 추출을 위한 지역 시공간 인코딩, 학습된 주의를 통한 효율적인 공간 다운샘플링, 프레임 수준 및 비디오 수준 관계 모델링을 위한 별도의 메커니즘을 결합합니다. 우리의 모델은 표준 벤치마크에서 오픈엔드 비디오 질문 응답에 대해 인코더 기반 접근 방식과 비교 가능하거나 우수한 성능을 달성합니다. 미세한 비디오 질문 응답 평가는 우리 모델의 효과성을 입증하며, Video-ChatGPT 및 Video-LLaVA와 같은 인코더 기반 접근 방식을 정확성 및 시간적 이해와 같은 주요 측면에서 앞섭니다. 철저한 제거 연구는 우리의 아키텍처 선택을 검증하고, 이전 방법보다 3-4배 빠른 처리 속도를 달성하면서 우리의 시공간 모델링 접근 방식의 효과를 입증합니다. 코드는 https://github.com/jh-yi/Video-Panda에서 사용할 수 있습니다.

English

We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5times reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks. The fine-grained video question-answering evaluation demonstrates our model's effectiveness, outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key aspects like correctness and temporal understanding. Extensive ablation studies validate our architectural choices and demonstrate the effectiveness of our spatio-temporal modeling approach while achieving 3-4times faster processing speeds than previous methods. Code is available at https://github.com/jh-yi/Video-Panda.

비디오-판다: 인코더 없이 파라미터 효율적 정렬을 위한 비디오-언어 모델

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

초록

Support