ChatPaper.aiChatPaper

将视觉感知令牌引入多模态大语言模型

Introducing Visual Perception Token into Multimodal Large Language Model

February 24, 2025
作者: Runpeng Yu, Xinyin Ma, Xinchao Wang
cs.AI

摘要

为了有效利用视觉信息,多模态大语言模型(MLLM)依赖于其视觉编码器的感知过程。视觉感知的完整性与准确性,对空间推理、细粒度理解等任务的精确度有着显著影响。然而,MLLM目前尚缺乏自主控制其视觉感知过程的能力,例如,无法有选择性地审视图像的特定区域或聚焦于与特定物体类别相关的信息。在本研究中,我们提出了视觉感知令牌的概念,旨在为MLLM赋予一种机制,以控制其视觉感知过程。我们设计了两类视觉感知令牌,分别称为区域选择令牌和视觉重编码令牌。MLLM如同生成文本一般自主生成这些令牌,并利用它们触发额外的视觉感知动作。区域选择令牌明确标识图像中需要进一步感知的特定区域,而视觉重编码令牌则利用其隐藏状态作为控制信号,引导额外的视觉感知过程。大量实验证明,这些令牌在处理空间推理、提升细粒度理解等任务中具有显著优势。平均而言,引入视觉感知令牌使一个2B参数模型的性能提升了23.6%,得分从0.572提高至0.708,甚至超越了7B参数模型13.4%(从0.624起)。欢迎访问我们的代码库:https://github.com/yu-rp/VisualPerceptionToken。
English
To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken

Summary

AI-Generated Summary

PDF142February 26, 2025