PPLLaVA：通过提示引导实现多样化视频序列理解

摘要

过去一年见证了基于视频的大型语言模型的显著进展。然而，为短视频和长视频开发统一模型的挑战仍未解决。大多数现有的视频LLM无法处理长达数小时的视频，而专为长视频设计的方法往往对短视频和图像无效。在本文中，我们确定关键问题为视频中的冗余内容。为了解决这一问题，我们提出了一种新颖的池化策略，同时实现了标记压缩和指令感知的视觉特征聚合。我们的模型被称为Prompt-guided Pooling LLaVA，简称PPLLaVA。具体而言，PPLLaVA包括三个核心组件：基于CLIP的视觉提示对齐，提取与用户指令相关的视觉信息；指导型池化，使用类卷积池化将视觉序列压缩到任意尺度；以及用于视觉对话中常见的长提示的剪辑上下文扩展。此外，我们的代码库还集成了最先进的视频直接偏好优化（DPO）和视觉交错训练。大量实验证实了我们模型的性能。在仅使用1024个视觉上下文的情况下，PPLLaVA在图像基准测试中作为视频LLM取得了更好的结果，同时在各种视频基准测试中取得了最先进的性能，在从生成字幕到多项选择题等任务中表现出色，并处理从几秒到数小时的视频长度。代码已在https://github.com/farewellthree/PPLLaVA 上提供。

English

The past year has witnessed the significant advancement of video-based large language models. However, the challenge of developing a unified model for both short and long video understanding remains unresolved. Most existing video LLMs cannot handle hour-long videos, while methods custom for long videos tend to be ineffective for shorter videos and images. In this paper, we identify the key issue as the redundant content in videos. To address this, we propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation. Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of three core components: the CLIP-based visual-prompt alignment that extracts visual information relevant to the user's instructions, the prompt-guided pooling that compresses the visual sequence to arbitrary scales using convolution-style pooling, and the clip context extension designed for lengthy prompt common in visual dialogue. Moreover, our codebase also integrates the most advanced video Direct Preference Optimization (DPO) and visual interleave training. Extensive experiments have validated the performance of our model. With superior throughput and only 1024 visual context, PPLLaVA achieves better results on image benchmarks as a video LLM, while achieving state-of-the-art performance across various video benchmarks, excelling in tasks ranging from caption generation to multiple-choice questions, and handling video lengths from seconds to hours. Codes have been available at https://github.com/farewellthree/PPLLaVA.

PPLLaVA：通过提示引导实现多样化视频序列理解

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

摘要

Summary

Support

Support