ChatPaper.aiChatPaper

MME调查:关于多模态LLM评估的综合调查

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

November 22, 2024
作者: Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, Caifeng Shan, Ran He
cs.AI

摘要

作为人工通用智能(AGI)的一个重要方向,多模态大型语言模型(MLLMs)在工业界和学术界都受到了越来越多的关注。这一系列模型是在预训练的语言模型(LLMs)基础上构建的,进一步发展了令人印象深刻的多模态感知和推理能力,例如根据流程图编写代码或根据图像创作故事。在开发过程中,评估至关重要,因为它提供了直观的反馈和指导,帮助改进模型。与传统的训练-评估-测试范式不同,后者只偏重于像图像分类这样的单一任务,MLLMs的多功能性催生了各种新的基准和评估方法的兴起。本文旨在全面调查MLLM评估,讨论四个关键方面:1)根据评估能力划分的总结的基准类型,包括基础能力、模型自我分析和扩展应用;2)基准构建的典型过程,包括数据收集、注释和注意事项;3)由评委、度量和工具包组成的系统评估方式;4)下一个基准的展望。这项工作旨在为研究人员提供如何根据不同需求有效评估MLLMs的简便方法,并激发更好的评估方法,推动MLLM研究的进展。
English
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building upon pre-trained LLMs, this family of models further develops multimodal perception and reasoning capabilities that are impressive, such as writing code given a flow chart or creating stories based on an image. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. Distinct from the traditional train-eval-test paradigm that only favors a single task like image classification, the versatility of MLLMs has spurred the rise of various new benchmarks and evaluation methods. In this paper, we aim to present a comprehensive survey of MLLM evaluation, discussing four key aspects: 1) the summarised benchmarks types divided by the evaluation capabilities, including foundation capabilities, model self-analysis, and extented applications; 2) the typical process of benchmark counstruction, consisting of data collection, annotation, and precautions; 3) the systematic evaluation manner composed of judge, metric, and toolkit; 4) the outlook for the next benchmark. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods, thereby driving the progress of MLLM research.

Summary

AI-Generated Summary

PDF222November 27, 2024