ChatPaper.aiChatPaper

MDK12-Bench:一个多学科基准,用于评估多模态大语言模型的推理能力

MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

April 8, 2025
作者: Pengfei Zhou, Fanrui Zhang, Xiaopeng Peng, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, Kaipeng Zhang
cs.AI

摘要

多模态推理,即融合语言与视觉线索进行问题解决与决策,是人类智能的核心要素,也是迈向通用人工智能的关键一步。然而,当前对多模态大语言模型(MLLMs)在多模态推理能力上的评估仍显不足。多数现有推理基准受限于数据规模小、领域覆盖窄及知识分布零散等问题。为填补这些空白,我们推出了MDK12-Bench,一个基于真实世界K-12考试的多学科基准,旨在全面评估MLLMs的推理能力。该基准横跨数学、物理、化学、生物、地理和信息科学六大学科,包含从小学到高中十二年级不同难度级别的14万条推理实例,并基于精心构建的知识体系标注了6,827个实例级知识点,提供详尽的答案解析、难度标签及跨年度划分,为全面评估搭建了坚实平台。此外,我们提出了一种新颖的动态评估框架,通过在评估过程中引导问题形式、题型及图像风格,有效缓解数据污染问题。在MDK12-Bench上的大量实验揭示了当前MLLMs在多模态推理方面的显著局限。本基准的发现为下一代模型的开发提供了深刻洞见。我们的数据与代码已公开于https://github.com/LanceZPF/MDK12。
English
Multimodal reasoning, which integrates language and visual cues into problem solving and decision making, is a fundamental aspect of human intelligence and a crucial step toward artificial general intelligence. However, the evaluation of multimodal reasoning capabilities in Multimodal Large Language Models (MLLMs) remains inadequate. Most existing reasoning benchmarks are constrained by limited data size, narrow domain coverage, and unstructured knowledge distribution. To close these gaps, we introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations. Spanning six disciplines (math, physics, chemistry, biology, geography, and information science), our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade. It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions, providing a robust platform for comprehensive evaluation. Additionally, we present a novel dynamic evaluation framework to mitigate data contamination issues by bootstrapping question forms, question types, and image styles during evaluation. Extensive experiment on MDK12-Bench reveals the significant limitation of current MLLMs in multimodal reasoning. The findings on our benchmark provide insights into the development of the next-generation models. Our data and codes are available at https://github.com/LanceZPF/MDK12.

Summary

AI-Generated Summary

PDF42April 15, 2025