Pixel-SAIL：基於像素理解的單一Transformer模型

摘要

多模態大型語言模型（MLLMs）在細粒度像素級理解任務中展現出卓越的性能。然而，現有研究均高度依賴額外組件，如視覺編碼器（CLIP）、分割專家等，這導致系統複雜度高，限制了模型的擴展性。在本研究中，我們的目標是探索一種高度簡化的MLLM，無需引入額外組件。我們的工作受到近期關於單一Transformer作為統一視覺-語言模型（SAIL）設計的研究啟發，這些研究在Transformer中聯合學習視覺標記和文本標記。我們提出了Pixel-SAIL，一種用於像素級MLLM任務的單一Transformer。具體而言，我們在基礎模型上實現了三項技術改進。首先，我們設計了一個可學習的上採樣模塊，以精煉視覺標記特徵。其次，我們提出了一種新穎的視覺提示注入策略，使單一Transformer能夠理解視覺提示輸入，並受益於視覺提示嵌入與視覺標記的早期融合。第三，我們引入了視覺專家蒸餾策略，有效增強了單一Transformer的細粒度特徵提取能力。此外，我們通過人工檢查收集了一個全面的像素理解基準（PerBench），包括三項任務：詳細物體描述、基於視覺提示的問答以及視覺-文本參考分割。在四個參考分割基準、一個視覺提示基準及我們的PerBench上的大量實驗表明，我們的Pixel-SAIL以更簡潔的流程實現了可比甚至更優的結果。代碼和模型將在https://github.com/magic-research/Sa2VA 上發布。

English

Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.

Pixel-SAIL：基於像素理解的單一Transformer模型

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

摘要

Summary

Support

Support