MME-Unify:统一多模态理解与生成模型的综合基准
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
April 4, 2025
作者: Wulin Xie, Yi-Fan Zhang, Chaoyou Fu, Yang Shi, Bingyan Nie, Hongkai Chen, Zhang Zhang, Liang Wang, Tieniu Tan
cs.AI
摘要
现有的多模态大语言模型(MLLM)基准在评估统一多模态大语言模型(U-MLLMs)时面临显著挑战,原因在于:1)缺乏针对传统任务的标准化基准,导致比较结果不一致;2)缺少混合模态生成任务的基准,无法全面评估多模态推理能力。为此,我们提出了一套全面的评估框架,旨在系统性地评估U-MLLMs。我们的基准包含:1. 标准化传统任务评估。我们从12个数据集中采样,涵盖10个任务及30个子任务,确保研究间的一致性和公平比较。2. 统一任务评估。我们引入了五项新颖任务,测试多模态推理能力,包括图像编辑、结合图像生成的常识问答以及几何推理。3. 全面模型基准测试。我们评估了12个领先的U-MLLMs,如Janus-Pro、EMU3、VILA-U和Gemini2-flash,同时对比了专门的理解模型(如Claude-3.5-Sonnet)和生成模型(如DALL-E-3)。研究结果显示,现有U-MLLMs在性能上存在显著差距,强调了开发更强大模型以有效处理混合模态任务的必要性。代码及评估数据可在https://mme-unify.github.io/获取。
English
Existing MLLM benchmarks face significant challenges in evaluating Unified
MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional
tasks, leading to inconsistent comparisons; 2) absence of benchmarks for
mixed-modality generation, which fails to assess multimodal reasoning
capabilities. We present a comprehensive evaluation framework designed to
systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional
Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30
subtasks, ensuring consistent and fair comparisons across studies." 2. Unified
Task Assessment. We introduce five novel tasks testing multimodal reasoning,
including image editing, commonsense QA with image generation, and geometric
reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs,
such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized
understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3).
Our findings reveal substantial performance gaps in existing U-MLLMs,
highlighting the need for more robust models capable of handling mixed-modality
tasks effectively. The code and evaluation data can be found in
https://mme-unify.github.io/.Summary
AI-Generated Summary