MDK12-Bench:一個多學科基準,用於評估多模態大型語言模型的推理能力
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
April 8, 2025
作者: Pengfei Zhou, Fanrui Zhang, Xiaopeng Peng, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, Kaipeng Zhang
cs.AI
摘要
多模態推理,即整合語言與視覺線索於問題解決與決策制定之中,是人類智能的基礎要素,也是邁向人工通用智能的關鍵一步。然而,現有對多模態大語言模型(MLLMs)在多模態推理能力上的評估仍顯不足。大多數現有的推理基準受限於數據規模小、領域覆蓋窄及知識分佈無序等問題。為彌補這些不足,我們推出了MDK12-Bench,這是一個跨學科的基準測試,通過真實的K-12考試來評估MLLMs的推理能力。涵蓋數學、物理、化學、生物、地理和信息科學六大學科,我們的基準包含從小學到十二年級共14萬個不同難度層次的推理實例,並基於精心組織的知識結構,提供了6,827個實例級別的知識點標註、詳盡的答案解析、難度標籤及跨年級劃分,為全面評估提供了堅實平台。此外,我們提出了一種新穎的動態評估框架,通過在評估過程中引導問題形式、題型及圖像風格,有效緩解數據污染問題。在MDK12-Bench上的廣泛實驗揭示了當前MLLMs在多模態推理方面的顯著侷限性。基於我們基準的發現,為下一代模型的開發提供了洞見。我們的數據與代碼已公開於https://github.com/LanceZPF/MDK12。
English
Multimodal reasoning, which integrates language and visual cues into problem
solving and decision making, is a fundamental aspect of human intelligence and
a crucial step toward artificial general intelligence. However, the evaluation
of multimodal reasoning capabilities in Multimodal Large Language Models
(MLLMs) remains inadequate. Most existing reasoning benchmarks are constrained
by limited data size, narrow domain coverage, and unstructured knowledge
distribution. To close these gaps, we introduce MDK12-Bench, a
multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via
real-world K-12 examinations. Spanning six disciplines (math, physics,
chemistry, biology, geography, and information science), our benchmark
comprises 140K reasoning instances across diverse difficulty levels from
primary school to 12th grade. It features 6,827 instance-level knowledge point
annotations based on a well-organized knowledge structure, detailed answer
explanations, difficulty labels and cross-year partitions, providing a robust
platform for comprehensive evaluation. Additionally, we present a novel dynamic
evaluation framework to mitigate data contamination issues by bootstrapping
question forms, question types, and image styles during evaluation. Extensive
experiment on MDK12-Bench reveals the significant limitation of current MLLMs
in multimodal reasoning. The findings on our benchmark provide insights into
the development of the next-generation models. Our data and codes are available
at https://github.com/LanceZPF/MDK12.Summary
AI-Generated Summary