4D-Bench:面向四维物体理解的多模态大语言模型基准测试
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
March 22, 2025
作者: Wenxuan Zhu, Bing Li, Cheng Zheng, Jinjie Mai, Jun Chen, Letian Jiang, Abdullah Hamdi, Sara Rojas Martinez, Chia-Wen Lin, Mohamed Elhoseiny, Bernard Ghanem
cs.AI
摘要
多模态大语言模型(MLLMs)在二维图像/视频理解方面展现了卓越的能力。然而,目前尚缺乏公开的标准基准来评估MLLMs在理解四维物体(即随时间演化的三维物体)方面的能力。本文中,我们引入了4D-Bench,这是首个旨在评估MLLMs四维物体理解能力的基准,包含四维物体问答(4D object QA)和四维物体描述(4D object captioning)任务。4D-Bench提供了多样类别的四维物体、高质量的标注,以及需要多视角时空理解的任务,与现有的基于二维图像/视频的基准不同。借助4D-Bench,我们对一系列开源和闭源的MLLMs进行了评估。四维物体描述实验结果显示,MLLMs在时间理解方面普遍弱于外观理解,尤其是开源模型在外观理解上接近闭源模型表现,但在时间理解上存在较大差距。四维物体问答实验得出了令人意外的发现:即便是面对简单的单物体视频,MLLMs表现欠佳,最先进的GPT-4o仅达到63%的准确率,而人类基准为91%。这些发现凸显了四维物体理解领域的显著差距,以及MLLMs进一步发展的必要性。
English
Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D
image/video understanding capabilities. However, there are no publicly
standardized benchmarks to assess the abilities of MLLMs in understanding the
4D objects (3D objects with temporal evolution over time). In this paper, we
introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs
in 4D object understanding, featuring tasks in 4D object Question Answering (4D
object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse
categories, high-quality annotations, and tasks necessitating multi-view
spatial-temporal understanding, different from existing 2D image/video-based
benchmarks. With 4D-Bench, we evaluate a wide range of open-source and
closed-source MLLMs. The results from the 4D object captioning experiment
indicate that MLLMs generally exhibit weaker temporal understanding compared to
their appearance understanding, notably, while open-source models approach
closed-source performance in appearance understanding, they show larger
performance gaps in temporal understanding. 4D object QA yields surprising
findings: even with simple single-object videos, MLLMs perform poorly, with
state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human
baseline of 91\%. These findings highlight a substantial gap in 4D object
understanding and the need for further advancements in MLLMs.Summary
AI-Generated Summary