LLaVA-UHD v2: 계층적 창 트랜스포머를 통해 고해상도 피라미드 특징을 통합하는 MLLM

초록

다중 모달 대형 언어 모델 (MLLMs)에서는 시각 트랜스포머 (ViTs)가 시각 인코딩에 널리 사용됩니다. 그러나 이러한 모델들이 범용 MLLM 작업을 해결하는 성능이 만족스럽지 않다고 판단됩니다. 이는 다양한 시각 수준에서의 정보 부족으로, 언어 생성에 필요한 다양한 의미적 세분성과의 조정을 방해한다고 합니다. 이 문제를 해결하기 위해 우리는 LLaVA-UHD v2를 제안합니다. 이는 고해상도 피라미드 특징을 구축하고 통합하여 다양한 시각적 세분성을 포착할 수 있는 계층적 창 트랜스포머를 중심으로 한 고급 MLLM입니다. 시각-언어 프로젝터로서 Hiwin 트랜스포머는 두 가지 주요 모듈로 구성됩니다: (i) ViT 유도 특징 업샘플링 프로세스를 통해 이미지 피라미드에서 고주파 세부 정보를 활용하여 구성된 역 특징 피라미드, 그리고 (ii) 교차 스케일 창 내의 일련의 주요 샘플링 특징에 초점을 맞춘 계층적 창 어텐션, 다중 수준 특징 맵을 압축합니다. 광범위한 실험 결과, LLaVA-UHD v2가 인기 있는 벤치마크에서 기존 MLLMs보다 우수한 성능을 달성한다는 것을 보여줍니다. 특히, 우리의 설계는 기준 방법과 비교하여 14개 벤치마크 전체에서 평균 3.7%의 성능 향상을 가져오며, 예를 들어 DocVQA에서는 9.3%의 향상을 보입니다. 우리는 모든 데이터, 모델 체크포인트 및 코드를 공개하여 향후 연구를 용이하게 합니다.

English

In multimodal large language models (MLLMs), vision transformers (ViTs) are widely employed for visual encoding. However, their performance in solving universal MLLM tasks is not satisfactory. We attribute it to a lack of information from diverse visual levels, impeding alignment with the various semantic granularity required for language generation. To address this issue, we present LLaVA-UHD v2, an advanced MLLM centered around a Hierarchical window transformer that enables capturing diverse visual granularity by constructing and integrating a high-resolution feature pyramid. As a vision-language projector, Hiwin transformer comprises two primary modules: (i) an inverse feature pyramid, constructed by a ViT-derived feature up-sampling process utilizing high-frequency details from an image pyramid, and (ii) hierarchical window attention, focusing on a set of key sampling features within cross-scale windows to condense multi-level feature maps. Extensive experiments demonstrate that LLaVA-UHD v2 achieves superior performance over existing MLLMs on popular benchmarks. Notably, our design brings an average boost of 3.7% across 14 benchmarks compared with the baseline method, 9.3% on DocVQA for instance. We make all the data, model checkpoint, and code publicly available to facilitate future research.

LLaVA-UHD v2: 계층적 창 트랜스포머를 통해 고해상도 피라미드 특징을 통합하는 MLLM

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

초록

Summary

Support

Support