感知標記增強多模式語言模型中的視覺推理

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

December 4, 2024
作者: Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G. Shapiro, Ranjay Krishna
cs.AI

摘要

多模式語言模型(MLMs)在基本視覺感知任務上仍面臨挑戰,而專門模型表現出色。需要推理三維結構的任務受益於深度估計,而需要推理二維物件實例的任務受益於物件檢測。然而,MLMs 無法生成中間深度或框來進行推理。在相關數據上微調 MLMs 並不能很好地泛化,而將計算外包給專門的視覺工具則過於計算密集且記憶效率低下。為了應對這一問題,我們引入了感知標記(Perception Tokens),這是一種設計用於協助語言無法涵蓋的推理任務的內在圖像表示。感知標記充當輔助推理標記,類似於語言模型中的思維鏈提示。例如,在與深度相關的任務中,增強了感知標記的 MLM 可以通過生成深度圖作為標記來進行推理,從而有效地解決問題。我們提出了 AURORA,一種訓練方法,通過感知標記來增強 MLMs 對視覺輸入的推理能力。AURORA 利用 VQVAE 將中間圖像表示轉換為標記化格式和邊界框標記,然後應用於多任務訓練框架。AURORA 在計數基準上實現了顯著的改進:在 BLINK 上提高了 +10.8%,在 CVBench 上提高了 +11.3%,在 SEED-Bench 上提高了 +8.3%,在跨數據集泛化方面優於微調方法。它還改進了相對深度:在 BLINK 上提高了超過 +6%。憑藉感知標記,AURORA 擴展了 MLMs 的範疇,超越基於語言的推理,為更有效的視覺推理能力鋪平了道路。
English
Multimodal language models (MLMs) still face challenges in fundamental visual perception tasks where specialized models excel. Tasks requiring reasoning about 3D structures benefit from depth estimation, and reasoning about 2D object instances benefits from object detection. Yet, MLMs can not produce intermediate depth or boxes to reason over. Finetuning MLMs on relevant data doesn't generalize well and outsourcing computation to specialized vision tools is too compute-intensive and memory-inefficient. To address this, we introduce Perception Tokens, intrinsic image representations designed to assist reasoning tasks where language is insufficient. Perception tokens act as auxiliary reasoning tokens, akin to chain-of-thought prompts in language models. For example, in a depth-related task, an MLM augmented with perception tokens can reason by generating a depth map as tokens, enabling it to solve the problem effectively. We propose AURORA, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. AURORA leverages a VQVAE to transform intermediate image representations, such as depth maps into a tokenized format and bounding box tokens, which is then used in a multi-task training framework. AURORA achieves notable improvements across counting benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, outperforming finetuning approaches in generalization across datasets. It also improves on relative depth: over +6% on BLINK. With perception tokens, AURORA expands the scope of MLMs beyond language-based reasoning, paving the way for more effective visual reasoning capabilities.

Summary

AI-Generated Summary

PDF172December 11, 2024