从另一视角审视:评估多模态大语言模型中的多视图理解能力
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
April 21, 2025
作者: Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Rouyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, Yi Ma
cs.AI
摘要
多视角理解,即通过整合不同视角下的视觉信息以实现有效导航、操作及三维场景理解的能力,是作为具身代理的多模态大语言模型(MLLMs)面临的一项基础性挑战。尽管近期的MLLMs在高层推理与规划方面展现了显著进步,但在处理多视角几何一致性与跨视角对应关系时仍显不足。为全面评估MLLMs在多视角场景推理中的挑战,我们提出了“全方位基准”(All-Angles Bench),该基准包含90个多样化真实场景中超过2,100条人工精心标注的多视角问答对。我们的六项任务(计数、属性识别、相对距离、相对方向、物体操作及相机姿态估计)专门测试模型的几何对应能力及跨视角信息一致对齐的能力。通过对包括Gemini-2.0-Flash、Claude-3.7-Sonnet和GPT-4o在内的27个代表性MLLMs与人类评估者进行广泛实验,我们发现存在显著的性能差距,表明当前MLLMs远未达到人类水平。深入分析显示,MLLMs在以下两方面表现尤为欠佳:(1)部分遮挡视图下的跨视角对应关系;(2)粗略相机姿态的建立。这些发现强调了嵌入更强多视角意识的领域特定优化或模块的必要性。我们相信,“全方位基准”为缩小MLLMs与人类多视角理解之间的差距提供了宝贵见解,并作出了贡献。项目与基准已公开于https://danielchyeh.github.io/All-Angles-Bench/。
English
Multi-view understanding, the ability to reconcile visual information across
diverse viewpoints for effective navigation, manipulation, and 3D scene
comprehension, is a fundamental challenge in Multi-Modal Large Language Models
(MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive
advances in high-level reasoning and planning, they frequently fall short when
confronted with multi-view geometric consistency and cross-view correspondence.
To comprehensively evaluate the challenges of MLLMs in multi-view scene
reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human
carefully annotated multi-view question-answer pairs across 90 diverse
real-world scenes. Our six tasks (counting, attribute identification, relative
distance, relative direction, object manipulation, and camera pose estimation)
specifically test model's geometric correspondence and the capacity to align
information consistently across views. Our extensive experiments, benchmark on
27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and
GPT-4o against human evaluators reveals a substantial performance gap,
indicating that current MLLMs remain far from human-level proficiency. Through
in-depth analysis, we show that MLLMs are particularly underperforming under
two aspects: (1) cross-view correspondence for partially occluded views and (2)
establishing the coarse camera poses. These findings highlight the necessity of
domain-specific refinements or modules that embed stronger multi-view
awareness. We believe that our All-Angles Bench offers valuable insights and
contribute to bridging the gap between MLLMs and human-level multi-view
understanding. The project and benchmark are publicly available at
https://danielchyeh.github.io/All-Angles-Bench/.Summary
AI-Generated Summary