3DSRBench: 포괄적인 3D 공간 추론 벤치마크

초록

3D 공간 추론은 3D 공간 내 객체의 위치, 방향 및 공간적 관계를 분석하고 해석하는 능력을 말합니다. 이를 통해 모델은 3D 장면에 대한 포괄적인 이해를 발전시켜 자율 주행, 로봇 공학, AR/VR과 같은 다양한 분야에 적용할 수 있습니다. 대규모 다중 모달 모델(LMMs)은 이미지 및 비디오 이해 작업의 다양한 영역에서 현저한 진전을 이루었지만, 이러한 모델들이 다양한 자연 이미지에서 3D 공간 추론을 수행하는 능력에 대한 연구는 미비합니다. 본 연구에서는 12가지 질문 유형을 포함한 2,772개의 수동으로 주석이 달린 시각적 질문-답변 쌍을 포함하는 첫 번째 포괄적인 3D 공간 추론 벤치마크인 3DSRBench를 제안합니다. 데이터 분포를 균형 있게 조정하고 새로운 FlipEval 전략을 채택하여 3D 공간 추론 능력을 견고하고 철저하게 평가합니다. 또한, 카메라 3D 시점에 대한 3D 공간 추론의 견고성을 더 연구하기 위해 3DSRBench에는 일반 및 비표준 시점을 가진 이미지에 대한 3D 공간 추론 질문을 포함하는 두 가지 하위 집합이 포함되어 있습니다. 우리는 다양한 측면에서 LMMs의 한계를 밝히는 넓은 범위의 오픈 소스 및 프로프리어터리 LMMs를 벤치마킹하며, 높이, 방향, 위치 및 다중 객체 추론과 같은 3D 인식 측면에서의 성능 하락 및 비표준 카메라 시점 이미지에 대한 성능 하락을 확인합니다. 우리의 3DSRBench는 강력한 3D 추론 능력을 갖춘 LMMs의 미래 발전에 대한 소중한 발견과 통찰을 제공합니다. 프로젝트 페이지와 데이터셋은 https://3dsrbench.github.io에서 확인할 수 있습니다.

English

3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks, their capabilities to perform 3D spatial reasoning on diverse natural images are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs across 12 question types. We conduct robust and thorough evaluation of 3D spatial reasoning capabilities by balancing the data distribution and adopting a novel FlipEval strategy. To further study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench includes two subsets with 3D spatial reasoning questions on paired images with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, uncovering their limitations in various aspects of 3D awareness, such as height, orientation, location, and multi-object reasoning, as well as their degraded performance on images with uncommon camera viewpoints. Our 3DSRBench provide valuable findings and insights about the future development of LMMs with strong 3D reasoning capabilities. Our project page and dataset is available https://3dsrbench.github.io.

3DSRBench: 포괄적인 3D 공간 추론 벤치마크

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

초록

Summary

Support