4D-Bench：面向四维物体理解的多模态大语言模型基准测试

摘要

多模态大语言模型（MLLMs）在二维图像/视频理解方面展现了卓越的能力。然而，目前尚缺乏公开的标准基准来评估MLLMs在理解四维物体（即随时间演化的三维物体）方面的能力。本文中，我们引入了4D-Bench，这是首个旨在评估MLLMs四维物体理解能力的基准，包含四维物体问答（4D object QA）和四维物体描述（4D object captioning）任务。4D-Bench提供了多样类别的四维物体、高质量的标注，以及需要多视角时空理解的任务，与现有的基于二维图像/视频的基准不同。借助4D-Bench，我们对一系列开源和闭源的MLLMs进行了评估。四维物体描述实验结果显示，MLLMs在时间理解方面普遍弱于外观理解，尤其是开源模型在外观理解上接近闭源模型表现，但在时间理解上存在较大差距。四维物体问答实验得出了令人意外的发现：即便是面对简单的单物体视频，MLLMs表现欠佳，最先进的GPT-4o仅达到63%的准确率，而人类基准为91%。这些发现凸显了四维物体理解领域的显著差距，以及MLLMs进一步发展的必要性。

English

Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks. With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs. The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding. 4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human baseline of 91\%. These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.

4D-Bench：面向四维物体理解的多模态大语言模型基准测试

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

摘要

Summary

Support

Support