CAPTURe：通過遮擋物體計數評估視覺語言模型的空間推理能力

摘要

識別並推理被遮擋（部分或完全隱藏）的物體對於理解視覺場景至關重要，因為遮擋在現實環境中頻繁發生，並成為空間理解的障礙。為了測試模型在推理多個被遮擋物體方面的能力，我們引入了一項新任務——通過未見區域進行模式計數（CAPTURe），該任務要求模型通過推斷模式在遮擋物（阻擋場景部分視野的物體）後方的延續方式，來計數按特定模式排列的物體。CAPTURe既需要識別視覺模式，又需要進行推理，使其成為評估視覺語言模型（VLMs）是否理解被遮擋模式及具備空間理解能力的有效測試平台。通過要求模型對被遮擋物體進行推理，CAPTURe還測試了VLMs構建世界模型以填補缺失信息的能力。CAPTURe由兩部分組成：(1) CAPTURe-real，包含手動篩選的真實物體按模式排列的圖像；(2) CAPTURe-synthetic，一個使用生成的模式圖像進行控制的診斷測試。我們在CAPTURe上評估了四種強大的VLMs（GPT-4o、Intern-VL2、Molmo和Qwen2-VL），發現模型在處理被遮擋和未被遮擋的模式時均存在計數困難。關鍵的是，我們發現模型在遮擋情況下的表現更差，這表明VLMs在推斷未見空間關係方面也存在不足：即便是像GPT-4o這樣最強的VLMs，在遮擋情況下也無法準確計數。相比之下，人類在CAPTURe上的錯誤率極低。我們還發現，提供被遮擋物體位置的輔助信息能提升模型性能，這進一步證明了模型錯誤既源於處理遮擋的能力不足，也源於圖像計數的困難。

English

Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT-4o, Intern-VL2, Molmo, and Qwen2-VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT-4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty counting in images.

CAPTURe：通過遮擋物體計數評估視覺語言模型的空間推理能力

CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

摘要

Summary

Support

Support