視覺語言模型中的視角感知推理:基於心理意象的模擬
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
April 24, 2025
作者: Phillip Y. Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, Minhyuk Sung
cs.AI
摘要
我們提出了一個通過心理意象模擬實現視覺-語言模型(VLMs)中視角感知推理的框架。視角轉換,即從替代視角感知環境或情境的能力,是衡量人類級別視覺理解的關鍵基準,對於環境互動和與自主代理的協作至關重要。儘管VLMs在空間推理方面取得了進展,但最近的研究表明,現代VLMs顯著缺乏視角感知推理能力,並表現出強烈的自我中心解釋偏見。為了縮小VLMs與人類感知之間的差距,我們聚焦於心理意象的作用,即人類通過抽象表徵感知世界,從而促進視角轉換。基於此,我們提出了一個名為抽象視角轉換(Abstract Perspective Change, APC)的視角感知推理框架,該框架有效利用視覺基礎模型,如物體檢測、分割和方向估計,來構建場景抽象並實現視角轉換。我們在合成和真實圖像基準上的實驗,與各種VLMs相比,展示了我們框架在視角感知推理方面的顯著改進,進一步超越了微調的空間推理模型和基於新視角合成的方法。
English
We present a framework for perspective-aware reasoning in vision-language
models (VLMs) through mental imagery simulation. Perspective-taking, the
ability to perceive an environment or situation from an alternative viewpoint,
is a key benchmark for human-level visual understanding, essential for
environmental interaction and collaboration with autonomous agents. Despite
advancements in spatial reasoning within VLMs, recent research has shown that
modern VLMs significantly lack perspective-aware reasoning capabilities and
exhibit a strong bias toward egocentric interpretations. To bridge the gap
between VLMs and human perception, we focus on the role of mental imagery,
where humans perceive the world through abstracted representations that
facilitate perspective shifts. Motivated by this, we propose a framework for
perspective-aware reasoning, named Abstract Perspective Change (APC), that
effectively leverages vision foundation models, such as object detection,
segmentation, and orientation estimation, to construct scene abstractions and
enable perspective transformations. Our experiments on synthetic and real-image
benchmarks, compared with various VLMs, demonstrate significant improvements in
perspective-aware reasoning with our framework, further outperforming
fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.Summary
AI-Generated Summary