LLaVA-UHD v2：透過階層式視窗轉換器整合高解析度特徵金字塔的MLLM

摘要

在多模式大型語言模型（MLLMs）中，視覺Transformer（ViTs）被廣泛應用於視覺編碼。然而，它們在解決通用MLLM任務時的表現並不滿意。我們認為這是由於缺乏來自不同視覺層級的信息，阻礙了與語言生成所需的各種語義細微之間的對齊。為了解決這個問題，我們提出了LLaVA-UHD v2，這是一種先進的以分層窗口Transformer為中心的MLLM，它通過構建和整合高分辨率特徵金字塔來實現捕獲多樣化的視覺細微性。作為一個視覺-語言投影器，Hiwin Transformer包括兩個主要模塊：（i）一個逆特徵金字塔，通過使用圖像金字塔中的高頻細節進行ViT衍生特徵上採樣過程構建，以及（ii）分層窗口注意力，專注於跨尺度窗口中的一組關鍵採樣特徵，以壓縮多級特徵映射。廣泛的實驗表明，LLaVA-UHD v2在流行基準測試中優於現有的MLLM。值得注意的是，與基準方法相比，我們的設計在14個基準測試中平均提升了3.7％，例如在DocVQA上提升了9.3％。我們將所有數據、模型檢查點和代碼公開提供，以促進未來的研究。

English

In multimodal large language models (MLLMs), vision transformers (ViTs) are widely employed for visual encoding. However, their performance in solving universal MLLM tasks is not satisfactory. We attribute it to a lack of information from diverse visual levels, impeding alignment with the various semantic granularity required for language generation. To address this issue, we present LLaVA-UHD v2, an advanced MLLM centered around a Hierarchical window transformer that enables capturing diverse visual granularity by constructing and integrating a high-resolution feature pyramid. As a vision-language projector, Hiwin transformer comprises two primary modules: (i) an inverse feature pyramid, constructed by a ViT-derived feature up-sampling process utilizing high-frequency details from an image pyramid, and (ii) hierarchical window attention, focusing on a set of key sampling features within cross-scale windows to condense multi-level feature maps. Extensive experiments demonstrate that LLaVA-UHD v2 achieves superior performance over existing MLLMs on popular benchmarks. Notably, our design brings an average boost of 3.7% across 14 benchmarks compared with the baseline method, 9.3% on DocVQA for instance. We make all the data, model checkpoint, and code publicly available to facilitate future research.

LLaVA-UHD v2：透過階層式視窗轉換器整合高解析度特徵金字塔的MLLM

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

摘要

Summary

Support