ChatPaper.aiChatPaper

CAPTURe:通过遮挡物体计数评估视觉语言模型的空间推理能力

CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

April 21, 2025
作者: Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
cs.AI

摘要

识别和推理被遮挡(部分或完全隐藏)的物体对于理解视觉场景至关重要,因为在现实环境中遮挡现象频繁发生,并成为空间理解的障碍。为了测试模型在推理多个被遮挡物体方面的能力,我们引入了一项新颖的任务——通过不可见区域进行模式计数(CAPTURe),该任务要求模型通过推断模式在遮挡物(遮挡场景部分内容的物体)背后的延续情况来计数按特定模式排列的物体。CAPTURe既需要识别视觉模式又需要推理能力,使其成为评估视觉-语言模型(VLMs)是否理解被遮挡模式及具备空间理解技能的有用测试平台。通过要求模型推理被遮挡物体,CAPTURe还测试了VLMs构建世界模型以填补缺失信息的能力。CAPTURe包含两部分:(1) CAPTURe-real,使用手动筛选的真实物体按模式排列的图像;(2) CAPTURe-synthetic,一个使用生成模式图像进行的受控诊断。我们评估了四种强大的VLMs(GPT-4o、Intern-VL2、Molmo和Qwen2-VL)在CAPTURe上的表现,发现模型在处理被遮挡和未遮挡模式时均存在计数困难。关键的是,我们发现模型在存在遮挡时表现更差,这表明VLMs在推断不可见空间关系方面也存在不足:即使是像GPT-4o这样最强的VLMs也无法在有遮挡的情况下准确计数。相比之下,人类在CAPTURe上的错误率极低。我们还发现,提供被遮挡物体位置的辅助信息能提高模型性能,这进一步证实了模型错误既源于处理遮挡的不足,也源于图像计数的困难。
English
Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT-4o, Intern-VL2, Molmo, and Qwen2-VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT-4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty counting in images.

Summary

AI-Generated Summary

PDF32April 23, 2025