ChatPaper.aiChatPaper

Introducing Visual Perception Token into Multimodal Large Language Model

February 24, 2025
Authors: Runpeng Yu, Xinyin Ma, Xinchao Wang
cs.AI

Abstract

To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken

Summary

AI-Generated Summary

Paper Overview

Core Contribution

  • Introduces Visual Perception Tokens (VPTs) to enable Multimodal Large Language Models (MLLMs) to autonomously control their visual perception processes.
  • Proposes two types of VPTs: Region Selection Token and Vision Re-Encoding Token.
  • Demonstrates significant performance improvements in tasks like spatial reasoning, fine-grained understanding, and VQA.

Research Context

  • MLLMs rely on vision encoders for visual perception, but lack autonomous control over perception processes.
  • Prior approaches depend on manually designed pipelines for image annotations or feature enhancement.
  • This work explores enabling MLLMs to autonomously control visual perception using specialized tokens.

Keywords

  • Multimodal Large Language Models (MLLMs)
  • Visual Perception Tokens (VPTs)
  • Region Selection Token
  • Vision Re-Encoding Token
  • Spatial Reasoning
  • Fine-Grained Understanding
  • Visual Question Answering (VQA)

Background

Research Gap

  • MLLMs lack the ability to autonomously control visual perception processes, such as selectively reviewing specific regions or focusing on object categories.
  • Existing methods rely on manual pipelines, limiting the model's ability to adapt dynamically to visual inputs.

Technical Challenges

  • Designing tokens that can trigger and control visual perception processes without disrupting the next-token prediction paradigm of LLMs.
  • Ensuring compatibility between visual perception tokens and the existing MLLM architecture.

Prior Approaches

  • Visual Prompting: Uses manual annotations like points and masks to control segmentation tasks.
  • Function-Calling/Tool-Use: MLLMs use LLM outputs as arguments for subsequent functions or tools, but these are confined to the natural language space.
  • Crop and Re-Input: MLLMs output bounding boxes to crop and re-input images, but this approach struggles with precise coordinate alignment.

Methodology

Technical Architecture

  • Region Selection Token: Crops and re-encodes specific regions of the image based on the token's output.
  • Vision Re-Encoding Token: Triggers additional vision encoders (e.g., DINO, SAM) to re-encode the image, with the hidden state of the token controlling the final embeddings.

Implementation Details

  • Region Selection Token: Divides the image into a grid (e.g., 8x8) and uses cell indices to describe regions.
  • Vision Re-Encoding Token: Uses a projector to align re-encoded vision features with LLM embeddings, controlled by the hidden state of the token.

Innovation Points

  • Autonomous Control: MLLMs generate VPTs autonomously, similar to generating text, to control visual perception.
  • Fine-Grained Control: The hidden state of the Vision Re-Encoding Token allows for nuanced control over the perception process.
  • Iterative Perception: MLLMs can conduct multiple rounds of visual perception based on feedback from the tokens.

Results

Experimental Setup

  • Datasets: Evaluated on tasks like General VQA, Fine-Grained VQA, Spatial Reasoning, and Text/OCR-Related VQA.
  • Models: Qwen2-VL-2B and Qwen2-VL-7B models, with DINOv2 or SAM as additional vision encoders.
  • Evaluation Metrics: GPT-4o used to evaluate alignment between model responses and ground truth.

Key Findings

  • Performance Improvement: A 2B model with VPTs outperformed a 7B model without VPTs by 30.9% on average.
  • Task-Specific Gains: Significant improvements in Spatial Reasoning (34.6%) and Fine-Grained VQA (32.7%) tasks.
  • Zero-Shot Generalization: VPTs remained effective in zero-shot settings, outperforming or matching the 7B model on unseen datasets.

Limitations

  • Granularity Trade-off: Region Selection Tokens require careful tuning of grid granularity (k) for optimal performance.
  • Over-Parameterization: Increasing the number of Vision Re-Encoding Tokens can lead to overfitting in the projector.
  • Task-Specific Effectiveness: VPTs showed limited gains in some General VQA and Text/OCR-Related VQA tasks.

Conclusion

  • Visual Perception Tokens empower MLLMs to autonomously control their visual perception processes, significantly improving performance in tasks like spatial reasoning and fine-grained understanding.
  • The Region Selection Token and Vision Re-Encoding Token provide mechanisms for iterative and fine-grained visual perception, enhancing the model's ability to handle complex visual inputs.
  • Future work could explore extending VPTs to other visual prompting techniques and encoder models, as well as integrating them into LLM-agent or LLM-tool systems.

Featured Papers

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu WeiFeb 27, 2024610142

Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan QiuDec 19, 202435311

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen ZhangJan 22, 20253485

PDF142February 26, 2025