PPLLaVA：通過提示引導實現多樣化視頻序列理解

摘要

過去一年來，我們見證了基於影片的大型語言模型取得顯著進展。然而，為短影片和長影片開發統一模型的挑戰仍未解決。大多數現有的影片語言模型無法處理長達一小時的影片，而針對長影片定制的方法往往對於較短的影片和圖像效果不佳。在本文中，我們將關鍵問題定位為影片中的冗餘內容。為了應對這一問題，我們提出了一種新穎的池化策略，同時實現了標記壓縮和指示感知的視覺特徵聚合。我們的模型被稱為Prompt-guided Pooling LLaVA，簡稱PPLLaVA。具體而言，PPLLaVA 包括三個核心組件：基於CLIP的視覺提示對齊，提取與用戶指示相關的視覺信息；指導式池化，使用類似卷積的池化將視覺序列壓縮到任意尺度；以及針對視覺對話中常見的冗長提示設計的clip上下文擴展。此外，我們的代碼庫還整合了最先進的視頻直接偏好優化（DPO）和視覺交錯訓練。大量實驗已驗證了我們模型的性能。在僅使用1024個視覺上下文且具有出色的吞吐量的情況下，PPLLaVA 在圖像基準測試中作為視頻語言模型取得了更好的結果，同時在各種視頻基準測試中實現了最先進的性能，在從生成標題到多選問題的任務中表現優異，並處理從幾秒到幾小時的視頻長度。代碼已在以下鏈接提供：https://github.com/farewellthree/PPLLaVA。

English

The past year has witnessed the significant advancement of video-based large language models. However, the challenge of developing a unified model for both short and long video understanding remains unresolved. Most existing video LLMs cannot handle hour-long videos, while methods custom for long videos tend to be ineffective for shorter videos and images. In this paper, we identify the key issue as the redundant content in videos. To address this, we propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation. Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of three core components: the CLIP-based visual-prompt alignment that extracts visual information relevant to the user's instructions, the prompt-guided pooling that compresses the visual sequence to arbitrary scales using convolution-style pooling, and the clip context extension designed for lengthy prompt common in visual dialogue. Moreover, our codebase also integrates the most advanced video Direct Preference Optimization (DPO) and visual interleave training. Extensive experiments have validated the performance of our model. With superior throughput and only 1024 visual context, PPLLaVA achieves better results on image benchmarks as a video LLM, while achieving state-of-the-art performance across various video benchmarks, excelling in tasks ranging from caption generation to multiple-choice questions, and handling video lengths from seconds to hours. Codes have been available at https://github.com/farewellthree/PPLLaVA.

PPLLaVA：通過提示引導實現多樣化視頻序列理解

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

摘要

Summary

Support

Support