參數反轉影像金字塔網路用於視覺感知和多模式理解
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
January 14, 2025
作者: Zhaokai Wang, Xizhou Zhu, Xue Yang, Gen Luo, Hao Li, Changyao Tian, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai
cs.AI
摘要
影像金字塔被廣泛應用於頂尖方法中,以獲取多尺度特徵,用於精確的視覺感知和理解。然而,目前的影像金字塔使用相同的大尺度模型來處理多個解析度的影像,導致顯著的計算成本。為了應對這一挑戰,我們提出了一種新穎的網絡架構,稱為參數反轉影像金字塔網絡(PIIP)。具體而言,PIIP使用預訓練模型(如ViTs或CNNs)作為分支,來處理多尺度影像,其中較高解析度的影像由較小的網絡分支處理,以平衡計算成本和性能。為了整合不同空間尺度的信息,我們進一步提出了一種新穎的跨分支特徵交互機制。為了驗證PIIP,我們將其應用於各種感知模型和一個名為LLaVA的代表性多模式大型語言模型,並在各種任務上進行了廣泛的實驗,如目標檢測、分割、影像分類和多模式理解。PIIP在計算成本更低的情況下,相較於單分支和現有的多解析度方法,實現了更優異的性能。當應用於InternViT-6B,一個大規模視覺基礎模型時,PIIP在檢測和分割方面的性能可以提高1%-2%,僅使用原始計算量的40%-60%,最終在MS COCO上實現了60.0的框AP和在ADE20K上實現了59.7的mIoU。對於多模式理解,我們的PIIP-LLaVA在TextVQA上實現了73.0%的準確率,在MMBench上實現了74.5%,僅使用了2.8M的訓練數據。我們的代碼已在https://github.com/OpenGVLab/PIIP 上發布。
English
Image pyramids are widely adopted in top-performing methods to obtain
multi-scale features for precise visual perception and understanding. However,
current image pyramids use the same large-scale model to process multiple
resolutions of images, leading to significant computational cost. To address
this challenge, we propose a novel network architecture, called
Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses
pretrained models (ViTs or CNNs) as branches to process multi-scale images,
where images of higher resolutions are processed by smaller network branches to
balance computational cost and performance. To integrate information from
different spatial scales, we further propose a novel cross-branch feature
interaction mechanism. To validate PIIP, we apply it to various perception
models and a representative multimodal large language model called LLaVA, and
conduct extensive experiments on various tasks such as object detection,
segmentation, image classification and multimodal understanding. PIIP achieves
superior performance compared to single-branch and existing multi-resolution
approaches with lower computational cost. When applied to InternViT-6B, a
large-scale vision foundation model, PIIP can improve its performance by 1%-2%
on detection and segmentation with only 40%-60% of the original computation,
finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For
multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and
74.5% on MMBench with only 2.8M training data. Our code is released at
https://github.com/OpenGVLab/PIIP.Summary
AI-Generated Summary