3DSRBench:一个全面的3D空间推理基准测试
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
December 10, 2024
作者: Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Alan Yuille, Jieneng Chen
cs.AI
摘要
3D空间推理是分析和解释3D空间中物体的位置、方向和空间关系的能力。这使得模型能够全面理解3D场景,从而扩展了它们在自主导航、机器人技术和增强/虚拟现实等领域的适用性。虽然大型多模态模型(LMMs)在图像和视频理解任务的各个领域取得了显著进展,但它们在多样化自然图像上执行3D空间推理的能力却鲜为人知。在这项工作中,我们提出了第一个全面的3D空间推理基准,3DSRBench,包含了12种问题类型的2,772个手动注释的视觉问答对。我们通过平衡数据分布并采用一种新颖的FlipEval策略,对3D空间推理能力进行了强大而彻底的评估。为了进一步研究相机3D视角对3D空间推理的鲁棒性,我们的3DSRBench包括了两个子集,其中包含了关于具有常见和不常见视角的成对图像的3D空间推理问题。我们对各种开源和专有LMMs进行基准测试,揭示了它们在3D感知的各个方面(如高度、方向、位置和多物体推理)以及在具有不常见相机视角的图像上性能下降的限制。我们的3DSRBench提供了有关具有强大3D推理能力的LMMs未来发展的宝贵发现和见解。我们的项目页面和数据集可在https://3dsrbench.github.io上获得。
English
3D spatial reasoning is the ability to analyze and interpret the positions,
orientations, and spatial relationships of objects within the 3D space. This
allows models to develop a comprehensive understanding of the 3D scene,
enabling their applicability to a broader range of areas, such as autonomous
navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have
achieved remarkable progress in a wide range of image and video understanding
tasks, their capabilities to perform 3D spatial reasoning on diverse natural
images are less studied. In this work we present the first comprehensive 3D
spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual
question-answer pairs across 12 question types. We conduct robust and thorough
evaluation of 3D spatial reasoning capabilities by balancing the data
distribution and adopting a novel FlipEval strategy. To further study the
robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench
includes two subsets with 3D spatial reasoning questions on paired images with
common and uncommon viewpoints. We benchmark a wide range of open-sourced and
proprietary LMMs, uncovering their limitations in various aspects of 3D
awareness, such as height, orientation, location, and multi-object reasoning,
as well as their degraded performance on images with uncommon camera
viewpoints. Our 3DSRBench provide valuable findings and insights about the
future development of LMMs with strong 3D reasoning capabilities. Our project
page and dataset is available https://3dsrbench.github.io.Summary
AI-Generated Summary