ChatPaper.aiChatPaper

PAVE:视频大语言模型的修补与适配

PAVE: Patching and Adapting Video Large Language Models

March 25, 2025
作者: Zhuoming Liu, Yiquan Li, Khoi Duc Nguyen, Yiwu Zhong, Yin Li
cs.AI

摘要

预训练视频大语言模型(Video LLMs)展现出卓越的推理能力,然而将这些模型适应于涉及额外模态或数据类型(如音频或3D信息)的新任务仍具挑战性。本文中,我们提出了PAVE,一个灵活的框架,用于将预训练的视频大语言模型适配到带有辅助信号的下游任务,例如音频、3D线索或多视角视频。PAVE引入了轻量级的适配器,称为“补丁”,它们为基础模型添加少量参数和操作,而无需改变其架构或预训练权重。通过这种方式,PAVE能有效调整预训练基础模型,以支持多样化的下游任务,包括视听问答、3D推理、多视角视频识别以及高帧率视频理解。在这些任务中,PAVE显著提升了基础模型的性能,超越了特定任务的最先进模型,同时仅带来约0.1%的额外FLOPs和参数成本。此外,PAVE支持多任务学习,并能很好地泛化到不同的视频大语言模型。我们的代码可在https://github.com/dragonlzm/PAVE获取。
English
Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at https://github.com/dragonlzm/PAVE.

Summary

AI-Generated Summary

PDF42April 1, 2025