視覺語言模型中的視角感知推理：基於心理意象的模擬

摘要

我們提出了一個通過心理意象模擬實現視覺-語言模型（VLMs）中視角感知推理的框架。視角轉換，即從替代視角感知環境或情境的能力，是衡量人類級別視覺理解的關鍵基準，對於環境互動和與自主代理的協作至關重要。儘管VLMs在空間推理方面取得了進展，但最近的研究表明，現代VLMs顯著缺乏視角感知推理能力，並表現出強烈的自我中心解釋偏見。為了縮小VLMs與人類感知之間的差距，我們聚焦於心理意象的作用，即人類通過抽象表徵感知世界，從而促進視角轉換。基於此，我們提出了一個名為抽象視角轉換（Abstract Perspective Change, APC）的視角感知推理框架，該框架有效利用視覺基礎模型，如物體檢測、分割和方向估計，來構建場景抽象並實現視角轉換。我們在合成和真實圖像基準上的實驗，與各種VLMs相比，展示了我們框架在視角感知推理方面的顯著改進，進一步超越了微調的空間推理模型和基於新視角合成的方法。

English

We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.

視覺語言模型中的視角感知推理：基於心理意象的模擬

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

摘要

Summary

Support

Support