ChatPaper.aiChatPaper

ProBench:评估多模态基础模型在开放式跨领域专家任务上的表现

ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

March 10, 2025
作者: Yan Yang, Dongxu Li, Haoning Wu, Bei Chen, Liu Liu, Liyuan Pan, Junnan Li
cs.AI

摘要

解决专家级多模态任务是迈向通用智能的关键里程碑。随着多模态大语言模型(MLLMs)能力的不断提升,评估此类高级多模态智能变得必要且具挑战性。在本研究中,我们引入了ProBench,一个基于开放式用户查询的基准测试,这些查询需要专业知识和高级推理能力。ProBench包含4,000个高质量样本,由专业人士根据其日常生产力需求独立提交,覆盖了科学、艺术、人文、编程、数学及创意写作等10个领域和56个子领域。实验上,我们采用MLLM-as-a-Judge方法对24个最新模型进行了评估与比较。结果显示,尽管最佳开源模型与专有模型旗鼓相当,但ProBench在视觉感知、文本理解、领域知识及高级推理方面提出了显著挑战,从而为未来多模态AI研究提供了宝贵的方向指引。
English
Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries that require professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently submitted by professionals based on their daily productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning, thus providing valuable directions for future multimodal AI research efforts.

Summary

AI-Generated Summary

PDF32March 11, 2025