当大规模视觉语言模型遇见大规模遥感影像: 从粗到细的文本引导令牌剪枝
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
March 10, 2025
作者: Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li
cs.AI
摘要
高效理解大规模遥感图像(RSIs)的视觉-语言信息具有重要意义,但也充满挑战。当前的大型视觉-语言模型(LVLMs)通常采用有限的预定义网格处理图像,导致在处理千兆像素级RSIs时信息丢失。相反,使用无限制网格会显著增加计算成本。为了在降低计算复杂度的同时保留图像细节,我们提出了一种结合动态图像金字塔(DIP)的文本引导令牌剪枝方法。该方法包括:(i)区域聚焦模块(RFM),利用文本感知的区域定位能力识别关键视觉令牌;(ii)基于DIP的从粗到细的图像瓦片选择与视觉令牌剪枝策略,该策略由RFM输出引导,避免直接处理整个大尺寸图像。此外,现有评估LVLMs在大规模RSI上感知能力的基准存在问题多样性不足和图像尺寸受限的问题。我们构建了一个名为LRS-VQA的新基准,包含8个类别的7,333个问答对,图像长度可达27,328像素。在相同数据条件下,我们的方法在四个数据集上均优于现有的高分辨率策略。与现有的令牌缩减方法相比,我们的方法在高分辨率设置下展现出更高的效率。数据集和代码可在https://github.com/VisionXLab/LRS-VQA获取。
English
Efficient vision-language understanding of large Remote Sensing Images (RSIs)
is meaningful but challenging. Current Large Vision-Language Models (LVLMs)
typically employ limited pre-defined grids to process images, leading to
information loss when handling gigapixel RSIs. Conversely, using unlimited
grids significantly increases computational costs. To preserve image details
while reducing computational complexity, we propose a text-guided token pruning
method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i)
a Region Focus Module (RFM) that leverages text-aware region localization
capability to identify critical vision tokens, and (ii) a coarse-to-fine image
tile selection and vision token pruning strategy based on DIP, which is guided
by RFM outputs and avoids directly processing the entire large imagery.
Additionally, existing benchmarks for evaluating LVLMs' perception ability on
large RSI suffer from limited question diversity and constrained image sizes.
We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs
across 8 categories, with image length up to 27,328 pixels. Our method
outperforms existing high-resolution strategies on four datasets using the same
data. Moreover, compared to existing token reduction methods, our approach
demonstrates higher efficiency under high-resolution settings. Dataset and code
are in https://github.com/VisionXLab/LRS-VQA.Summary
AI-Generated Summary