Pixel-SAIL：面向像素级理解的单一Transformer模型

摘要

多模态大语言模型（MLLMs）在细粒度像素级理解任务中展现出卓越性能。然而，现有研究均高度依赖额外组件，如视觉编码器（CLIP）、分割专家等，导致系统复杂度高，限制了模型扩展。本研究旨在探索一种高度简化的MLLM，无需引入额外组件。我们的工作受到近期单Transformer作为统一视觉-语言模型（SAIL）设计研究的启发，这些研究在Transformer中联合学习视觉标记与文本标记。我们提出了Pixel-SAIL，一个面向像素级MLLM任务的单一Transformer模型。具体而言，我们在基础模型上实现了三项技术改进。首先，设计了一个可学习的上采样模块，以精炼视觉标记特征。其次，提出了一种新颖的视觉提示注入策略，使单一Transformer能够理解视觉提示输入，并受益于视觉提示嵌入与视觉标记的早期融合。再者，引入了一种视觉专家蒸馏策略，有效增强单一Transformer的细粒度特征提取能力。此外，我们通过人工检查收集了一个全面的像素理解基准（PerBench），包含三项任务：详细物体描述、基于视觉提示的问答以及视觉-文本参照分割。在四个参照分割基准、一个视觉提示基准及我们的PerBench上的大量实验表明，Pixel-SAIL以更为简洁的流程取得了可比甚至更优的结果。代码与模型将发布于https://github.com/magic-research/Sa2VA。

English

Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.

Pixel-SAIL：面向像素级理解的单一Transformer模型

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

摘要

Summary

Support

Support