参数反转图像金字塔网络用于视觉感知和多模态理解
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
January 14, 2025
作者: Zhaokai Wang, Xizhou Zhu, Xue Yang, Gen Luo, Hao Li, Changyao Tian, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai
cs.AI
摘要
图像金字塔被广泛应用于性能优越的方法中,以获取多尺度特征,用于精确的视觉感知和理解。然而,当前的图像金字塔使用相同的大尺度模型来处理多个分辨率的图像,导致了显著的计算成本。为了解决这一挑战,我们提出了一种新颖的网络架构,称为参数反转图像金字塔网络(PIIP)。具体而言,PIIP使用预训练模型(如ViTs或CNNs)作为分支来处理多尺度图像,其中更高分辨率的图像由较小的网络分支处理,以平衡计算成本和性能。为了整合不同空间尺度的信息,我们进一步提出了一种新颖的跨分支特征交互机制。为了验证PIIP,我们将其应用于各种感知模型和一个名为LLaVA的代表性多模态大型语言模型,并在各种任务上进行了广泛实验,如目标检测、分割、图像分类和多模态理解。PIIP相对于单分支和现有的多分辨率方法,以更低的计算成本实现了卓越的性能。当应用于InternViT-6B,一个大规模视觉基础模型时,PIIP可以在检测和分割上将性能提高1%-2%,仅使用原始计算的40%-60%,最终在MS COCO上实现60.0的box AP和在ADE20K上实现59.7的mIoU。对于多模态理解,我们的PIIP-LLaVA在TextVQA上实现了73.0%的准确率,在MMBench上实现了74.5%,仅使用了2.8M的训练数据。我们的代码已发布在https://github.com/OpenGVLab/PIIP。
English
Image pyramids are widely adopted in top-performing methods to obtain
multi-scale features for precise visual perception and understanding. However,
current image pyramids use the same large-scale model to process multiple
resolutions of images, leading to significant computational cost. To address
this challenge, we propose a novel network architecture, called
Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses
pretrained models (ViTs or CNNs) as branches to process multi-scale images,
where images of higher resolutions are processed by smaller network branches to
balance computational cost and performance. To integrate information from
different spatial scales, we further propose a novel cross-branch feature
interaction mechanism. To validate PIIP, we apply it to various perception
models and a representative multimodal large language model called LLaVA, and
conduct extensive experiments on various tasks such as object detection,
segmentation, image classification and multimodal understanding. PIIP achieves
superior performance compared to single-branch and existing multi-resolution
approaches with lower computational cost. When applied to InternViT-6B, a
large-scale vision foundation model, PIIP can improve its performance by 1%-2%
on detection and segmentation with only 40%-60% of the original computation,
finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For
multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and
74.5% on MMBench with only 2.8M training data. Our code is released at
https://github.com/OpenGVLab/PIIP.Summary
AI-Generated Summary