視覚認識とマルチモーダル理解のためのパラメータ反転画像ピラミッドネットワーク

要旨

画像ピラミッドは、正確な視覚認識と理解のためのマルチスケール特徴を取得するために、トップパフォーマンスの手法で広く採用されています。しかし、現在の画像ピラミッドは、複数の解像度の画像を処理するために同じ大規模モデルを使用しており、膨大な計算コストがかかっています。この課題に対処するために、私たちは新しいネットワークアーキテクチャ、Parameter-Inverted Image Pyramid Networks（PIIP）を提案します。具体的には、PIIPは、ViTsまたはCNNなどの事前学習モデルをブランチとして使用して、マルチスケール画像を処理し、より高解像度の画像を処理するために小さなネットワークブランチを使用して計算コストとパフォーマンスをバランスさせます。異なる空間スケールからの情報を統合するために、新しいクロスブランチ特徴相互作用メカニズムを提案しています。PIIPの有効性を検証するために、様々な認識モデルとLLaVAと呼ばれる代表的なマルチモーダル大規模言語モデルに適用し、物体検出、セグメンテーション、画像分類、マルチモーダル理解などのさまざまなタスクで包括的な実験を行います。PIIPは、単一ブランチおよび既存のマルチ解像度アプローチよりも優れたパフォーマンスを低い計算コストで達成します。大規模なビジョン基盤モデルであるInternViT-6Bに適用すると、PIIPは、元の計算量の40%-60%で検出とセグメンテーションのパフォーマンスを1%-2%向上させ、最終的にMS COCOで60.0のbox AP、ADE20Kで59.7のmIoUを達成します。マルチモーダル理解において、私たちのPIIP-LLaVAは、TextVQAで73.0%、MMBenchで74.5%の精度を達成し、訓練データがわずか2.8Mで済みます。私たちのコードはhttps://github.com/OpenGVLab/PIIP で公開されています。

English

Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data. Our code is released at https://github.com/OpenGVLab/PIIP.

視覚認識とマルチモーダル理解のためのパラメータ反転画像ピラミッドネットワーク

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

要旨

Support