3DSRBench:一個全面的3D空間推理基準测试
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
December 10, 2024
作者: Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Alan Yuille, Jieneng Chen
cs.AI
摘要
3D 空間推理是分析和解釋 3D 空間內物體的位置、方向和空間關係的能力。這使得模型能夠對 3D 場景進行全面理解,從而擴展了它們應用於更廣泛領域的能力,如自主導航、機器人技術和擴增/虛擬實境。雖然大型多模型模型(LMMs)在圖像和視頻理解的各種任務中取得了顯著進展,但它們在多樣自然圖像上進行 3D 空間推理的能力卻鮮少被研究。在這項工作中,我們提出了第一個全面的 3D 空間推理基準測試,3DSRBench,包含了 2,772 對手動標註的視覺問答對,涵蓋了 12 種問題類型。我們通過平衡數據分佈並採用一種新穎的 FlipEval 策略,對 3D 空間推理能力進行了堅固而全面的評估。為進一步研究相機 3D 觀點對 3D 空間推理的穩健性,我們的 3DSRBench 包括兩個子集,其中包含了關於具有常見和不常見觀點的成對圖像的 3D 空間推理問題。我們對各種開源和專有的 LMMs 進行基準測試,揭示了它們在 3D 意識的各個方面(如高度、方向、位置和多對象推理)以及在具有不常見相機觀點圖像上性能下降方面的局限性。我們的 3DSRBench 提供了有關具有強大 3D 推理能力的 LMMs 未來發展的寶貴發現和見解。我們的項目頁面和數據集可在 https://3dsrbench.github.io 上獲得。
English
3D spatial reasoning is the ability to analyze and interpret the positions,
orientations, and spatial relationships of objects within the 3D space. This
allows models to develop a comprehensive understanding of the 3D scene,
enabling their applicability to a broader range of areas, such as autonomous
navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have
achieved remarkable progress in a wide range of image and video understanding
tasks, their capabilities to perform 3D spatial reasoning on diverse natural
images are less studied. In this work we present the first comprehensive 3D
spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual
question-answer pairs across 12 question types. We conduct robust and thorough
evaluation of 3D spatial reasoning capabilities by balancing the data
distribution and adopting a novel FlipEval strategy. To further study the
robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench
includes two subsets with 3D spatial reasoning questions on paired images with
common and uncommon viewpoints. We benchmark a wide range of open-sourced and
proprietary LMMs, uncovering their limitations in various aspects of 3D
awareness, such as height, orientation, location, and multi-object reasoning,
as well as their degraded performance on images with uncommon camera
viewpoints. Our 3DSRBench provide valuable findings and insights about the
future development of LMMs with strong 3D reasoning capabilities. Our project
page and dataset is available https://3dsrbench.github.io.Summary
AI-Generated Summary