感知令牌增强多模态语言模型中的视觉推理
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
December 4, 2024
作者: Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G. Shapiro, Ranjay Krishna
cs.AI
摘要
多模态语言模型(MLMs)在基础视觉感知任务上仍面临挑战,而专门的模型表现出色。需要推理三维结构的任务受益于深度估计,而需要推理二维物体实例的任务受益于目标检测。然而,MLMs 无法生成中间深度或边界框进行推理。在相关数据上微调MLMs的泛化能力不佳,而将计算外包给专门的视觉工具则计算密集且内存效率低。为了解决这个问题,我们引入感知 Token,这是一种旨在辅助推理任务的内在图像表示,以弥补语言不足之处。感知 Token 充当辅助推理 Token,类似于语言模型中的思维链提示。例如,在与深度相关的任务中,通过添加感知 Token 的 MLM 可以通过生成深度图作为 Token 进行推理,从而有效地解决问题。我们提出了 AURORA,一种训练方法,通过感知 Token 增强 MLMs 以改善对视觉输入的推理能力。AURORA 利用 VQVAE 将中间图像表示(如深度图)转换为标记化格式和边界框 Token,然后在多任务训练框架中使用。AURORA 在计数基准测试中取得了显著的改进:在 BLINK 上提高了 +10.8%,在 CVBench 上提高了 +11.3%,在 SEED-Bench 上提高了 +8.3%,在数据集泛化方面优于微调方法。它还改善了相对深度:在 BLINK 上提高了超过 +6%。通过感知 Token,AURORA 将MLMs的范围扩展到基于视觉的推理,为更有效的视觉推理能力铺平道路。
English
Multimodal language models (MLMs) still face challenges in fundamental visual
perception tasks where specialized models excel. Tasks requiring reasoning
about 3D structures benefit from depth estimation, and reasoning about 2D
object instances benefits from object detection. Yet, MLMs can not produce
intermediate depth or boxes to reason over. Finetuning MLMs on relevant data
doesn't generalize well and outsourcing computation to specialized vision tools
is too compute-intensive and memory-inefficient. To address this, we introduce
Perception Tokens, intrinsic image representations designed to assist reasoning
tasks where language is insufficient. Perception tokens act as auxiliary
reasoning tokens, akin to chain-of-thought prompts in language models. For
example, in a depth-related task, an MLM augmented with perception tokens can
reason by generating a depth map as tokens, enabling it to solve the problem
effectively. We propose AURORA, a training method that augments MLMs with
perception tokens for improved reasoning over visual inputs. AURORA leverages a
VQVAE to transform intermediate image representations, such as depth maps into
a tokenized format and bounding box tokens, which is then used in a multi-task
training framework. AURORA achieves notable improvements across counting
benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench,
outperforming finetuning approaches in generalization across datasets. It also
improves on relative depth: over +6% on BLINK. With perception tokens, AURORA
expands the scope of MLMs beyond language-based reasoning, paving the way for
more effective visual reasoning capabilities.Summary
AI-Generated Summary