시각 지각 및 다중 모달 이해를 위한 매개변수 반전 이미지 피라미드 네트워크

초록

이미지 피라미드는 정확한 시각 지각과 이해를 위해 다중 스케일 특징을 얻는 데에 널리 사용되는 최고 성능 방법들에서 채택되고 있습니다. 그러나 현재의 이미지 피라미드는 동일한 대규모 모델을 사용하여 여러 해상도의 이미지를 처리하므로 상당한 계산 비용이 발생합니다. 이러한 도전에 대처하기 위해 우리는 파라미터-역전된 이미지 피라미드 네트워크(PIIP)라는 새로운 네트워크 아키텍처를 제안합니다. 구체적으로, PIIP는 사전 훈련된 모델(ViTs 또는 CNNs)을 사용하여 다중 스케일 이미지를 처리하는 데에 가지를 형성합니다. 이 때, 더 높은 해상도의 이미지는 더 작은 네트워크 가지에 의해 처리되어 계산 비용과 성능을 균형있게 유지합니다. 서로 다른 공간 스케일에서 정보를 통합하기 위해 새로운 가지 간 특징 상호 작용 메커니즘을 제안합니다. PIIP의 유효성을 검증하기 위해 우리는 다양한 지각 모델과 LLaVA라는 대표적인 다중 모달 대형 언어 모델에 적용하고, 객체 검출, 분할, 이미지 분류 및 다중 모달 이해와 같은 다양한 작업에 대한 포괄적인 실험을 수행합니다. PIIP는 낮은 계산 비용으로 단일 가지 및 기존 다중 해상도 접근 방식과 비교하여 우수한 성능을 달성합니다. 대규모 비전 기초 모델인 InternViT-6B에 적용할 때, PIIP는 원래 계산의 40%-60%로 탐지 및 분할에서 1%-2%의 성능 향상을 이룰 수 있어, MS COCO에서 60.0 박스 AP, ADE20K에서 59.7 mIoU를 달성합니다. 다중 모달 이해에서, PIIP-LLaVA는 TextVQA에서 73.0% 정확도, MMBench에서 74.5%의 정확도를 달성하는 데에 2.8M의 훈련 데이터만 사용합니다. 우리의 코드는 https://github.com/OpenGVLab/PIIP에서 공개되어 있습니다.

English

Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data. Our code is released at https://github.com/OpenGVLab/PIIP.

시각 지각 및 다중 모달 이해를 위한 매개변수 반전 이미지 피라미드 네트워크

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

초록

Summary

Support