Introducing Visual Perception Token into Multimodal Large Language Model
Abstract
To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken
Summary
AI-Generated Summary
Paper Overview
Core Contribution
- Introduces Visual Perception Tokens (VPTs) to enable Multimodal Large Language Models (MLLMs) to autonomously control their visual perception processes.
- Proposes two types of VPTs: Region Selection Token and Vision Re-Encoding Token.
- Demonstrates significant performance improvements in tasks like spatial reasoning, fine-grained understanding, and VQA.
Research Context
- MLLMs rely on vision encoders for visual perception, but lack autonomous control over perception processes.
- Prior approaches depend on manually designed pipelines for image annotations or feature enhancement.
- This work explores enabling MLLMs to autonomously control visual perception using specialized tokens.
Keywords
- Multimodal Large Language Models (MLLMs)
- Visual Perception Tokens (VPTs)
- Region Selection Token
- Vision Re-Encoding Token
- Spatial Reasoning
- Fine-Grained Understanding
- Visual Question Answering (VQA)
Background
Research Gap
- MLLMs lack the ability to autonomously control visual perception processes, such as selectively reviewing specific regions or focusing on object categories.
- Existing methods rely on manual pipelines, limiting the model's ability to adapt dynamically to visual inputs.
Technical Challenges
- Designing tokens that can trigger and control visual perception processes without disrupting the next-token prediction paradigm of LLMs.
- Ensuring compatibility between visual perception tokens and the existing MLLM architecture.
Prior Approaches
- Visual Prompting: Uses manual annotations like points and masks to control segmentation tasks.
- Function-Calling/Tool-Use: MLLMs use LLM outputs as arguments for subsequent functions or tools, but these are confined to the natural language space.
- Crop and Re-Input: MLLMs output bounding boxes to crop and re-input images, but this approach struggles with precise coordinate alignment.
Methodology
Technical Architecture
- Region Selection Token: Crops and re-encodes specific regions of the image based on the token's output.
- Vision Re-Encoding Token: Triggers additional vision encoders (e.g., DINO, SAM) to re-encode the image, with the hidden state of the token controlling the final embeddings.
Implementation Details
- Region Selection Token: Divides the image into a grid (e.g., 8x8) and uses cell indices to describe regions.
- Vision Re-Encoding Token: Uses a projector to align re-encoded vision features with LLM embeddings, controlled by the hidden state of the token.
Innovation Points
- Autonomous Control: MLLMs generate VPTs autonomously, similar to generating text, to control visual perception.
- Fine-Grained Control: The hidden state of the Vision Re-Encoding Token allows for nuanced control over the perception process.
- Iterative Perception: MLLMs can conduct multiple rounds of visual perception based on feedback from the tokens.
Results
Experimental Setup
- Datasets: Evaluated on tasks like General VQA, Fine-Grained VQA, Spatial Reasoning, and Text/OCR-Related VQA.
- Models: Qwen2-VL-2B and Qwen2-VL-7B models, with DINOv2 or SAM as additional vision encoders.
- Evaluation Metrics: GPT-4o used to evaluate alignment between model responses and ground truth.
Key Findings
- Performance Improvement: A 2B model with VPTs outperformed a 7B model without VPTs by 30.9% on average.
- Task-Specific Gains: Significant improvements in Spatial Reasoning (34.6%) and Fine-Grained VQA (32.7%) tasks.
- Zero-Shot Generalization: VPTs remained effective in zero-shot settings, outperforming or matching the 7B model on unseen datasets.
Limitations
- Granularity Trade-off: Region Selection Tokens require careful tuning of grid granularity (k) for optimal performance.
- Over-Parameterization: Increasing the number of Vision Re-Encoding Tokens can lead to overfitting in the projector.
- Task-Specific Effectiveness: VPTs showed limited gains in some General VQA and Text/OCR-Related VQA tasks.
Conclusion
- Visual Perception Tokens empower MLLMs to autonomously control their visual perception processes, significantly improving performance in tasks like spatial reasoning and fine-grained understanding.
- The Region Selection Token and Vision Re-Encoding Token provide mechanisms for iterative and fine-grained visual perception, enhancing the model's ability to handle complex visual inputs.
- Future work could explore extending VPTs to other visual prompting techniques and encoder models, as well as integrating them into LLM-agent or LLM-tool systems.