PPLLaVA:通過提示引導實現多樣化視頻序列理解
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
November 4, 2024
作者: Ruyang Liu, Haoran Tang, Haibo Liu, Yixiao Ge, Ying Shan, Chen Li, Jiankun Yang
cs.AI
摘要
過去一年來,我們見證了基於影片的大型語言模型取得顯著進展。然而,為短影片和長影片開發統一模型的挑戰仍未解決。大多數現有的影片語言模型無法處理長達一小時的影片,而針對長影片定制的方法往往對於較短的影片和圖像效果不佳。在本文中,我們將關鍵問題定位為影片中的冗餘內容。為了應對這一問題,我們提出了一種新穎的池化策略,同時實現了標記壓縮和指示感知的視覺特徵聚合。我們的模型被稱為Prompt-guided Pooling LLaVA,簡稱PPLLaVA。具體而言,PPLLaVA 包括三個核心組件:基於CLIP的視覺提示對齊,提取與用戶指示相關的視覺信息;指導式池化,使用類似卷積的池化將視覺序列壓縮到任意尺度;以及針對視覺對話中常見的冗長提示設計的clip上下文擴展。此外,我們的代碼庫還整合了最先進的視頻直接偏好優化(DPO)和視覺交錯訓練。大量實驗已驗證了我們模型的性能。在僅使用1024個視覺上下文且具有出色的吞吐量的情況下,PPLLaVA 在圖像基準測試中作為視頻語言模型取得了更好的結果,同時在各種視頻基準測試中實現了最先進的性能,在從生成標題到多選問題的任務中表現優異,並處理從幾秒到幾小時的視頻長度。代碼已在以下鏈接提供:https://github.com/farewellthree/PPLLaVA。
English
The past year has witnessed the significant advancement of video-based large
language models. However, the challenge of developing a unified model for both
short and long video understanding remains unresolved. Most existing video LLMs
cannot handle hour-long videos, while methods custom for long videos tend to be
ineffective for shorter videos and images. In this paper, we identify the key
issue as the redundant content in videos. To address this, we propose a novel
pooling strategy that simultaneously achieves token compression and
instruction-aware visual feature aggregation. Our model is termed Prompt-guided
Pooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of three
core components: the CLIP-based visual-prompt alignment that extracts visual
information relevant to the user's instructions, the prompt-guided pooling that
compresses the visual sequence to arbitrary scales using convolution-style
pooling, and the clip context extension designed for lengthy prompt common in
visual dialogue. Moreover, our codebase also integrates the most advanced video
Direct Preference Optimization (DPO) and visual interleave training. Extensive
experiments have validated the performance of our model. With superior
throughput and only 1024 visual context, PPLLaVA achieves better results on
image benchmarks as a video LLM, while achieving state-of-the-art performance
across various video benchmarks, excelling in tasks ranging from caption
generation to multiple-choice questions, and handling video lengths from
seconds to hours. Codes have been available at
https://github.com/farewellthree/PPLLaVA.Summary
AI-Generated Summary