LLaVA-UHD v2:通过分层窗口变换器集成高分辨率特征金字塔的MLLM
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
December 18, 2024
作者: Yipeng Zhang, Yifan Liu, Zonghao Guo, Yidan Zhang, Xuesong Yang, Chi Chen, Jun Song, Bo Zheng, Yuan Yao, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun
cs.AI
摘要
在多模态大型语言模型(MLLMs)中,视觉Transformer(ViTs)被广泛应用于视觉编码。然而,它们在解决通用MLLM任务时的表现并不理想。我们将其归因于缺乏来自不同视觉层级的信息,阻碍了与语言生成所需的各种语义粒度的对齐。为了解决这个问题,我们提出了LLaVA-UHD v2,这是一种先进的MLLM,围绕着一个Hierarchical window transformer,它通过构建和集成高分辨率特征金字塔来实现捕获多样化的视觉粒度。作为一个视觉-语言投影仪,Hiwin transformer包括两个主要模块:(i)通过ViT衍生的特征上采样过程构建的逆特征金字塔,利用图像金字塔中的高频细节,以及(ii)分层窗口注意力,聚焦于跨尺度窗口内的一组关键采样特征,以压缩多级特征映射。大量实验证明,LLaVA-UHD v2在流行基准测试中比现有的MLLMs表现出更优越的性能。值得注意的是,与基准方法相比,我们的设计在14个基准测试中平均提升了3.7%,例如在DocVQA上提升了9.3%。我们公开提供所有数据、模型检查点和代码,以促进未来的研究。
English
In multimodal large language models (MLLMs), vision transformers (ViTs) are
widely employed for visual encoding. However, their performance in solving
universal MLLM tasks is not satisfactory. We attribute it to a lack of
information from diverse visual levels, impeding alignment with the various
semantic granularity required for language generation. To address this issue,
we present LLaVA-UHD v2, an advanced MLLM centered around a Hierarchical window
transformer that enables capturing diverse visual granularity by constructing
and integrating a high-resolution feature pyramid. As a vision-language
projector, Hiwin transformer comprises two primary modules: (i) an inverse
feature pyramid, constructed by a ViT-derived feature up-sampling process
utilizing high-frequency details from an image pyramid, and (ii) hierarchical
window attention, focusing on a set of key sampling features within cross-scale
windows to condense multi-level feature maps. Extensive experiments demonstrate
that LLaVA-UHD v2 achieves superior performance over existing MLLMs on popular
benchmarks. Notably, our design brings an average boost of 3.7% across 14
benchmarks compared with the baseline method, 9.3% on DocVQA for instance. We
make all the data, model checkpoint, and code publicly available to facilitate
future research.Summary
AI-Generated Summary