MME-Survey：關於多模態LLM評估的全面調查

摘要

作為人工通用智能（AGI）的一個重要方向，多模式大型語言模型（MLLMs）已經引起了工業界和學術界的廣泛關注。這一類模型是在預訓練的語言模型（LLMs）基礎上構建的，進一步發展了令人印象深刻的多模式感知和推理能力，例如根據流程圖編寫代碼或根據圖像創作故事。在開發過程中，評估至關重要，因為它提供直觀的反饋和指導，幫助改進模型。與傳統的僅偏好單一任務（如圖像分類）的訓練-評估-測試範式不同，MLLMs 的多功能性促使各種新的基準和評估方法的興起。本文旨在提供對MLLM評估的全面調查，討論四個關鍵方面：1）根據評估能力劃分的總結基準類型，包括基礎能力、模型自我分析和擴展應用；2）基準構建的典型過程，包括數據收集、標註和注意事項；3）由評審、度量和工具組成的系統評估方式；4）對下一個基準的展望。這項工作旨在為研究人員提供如何根據不同需求有效評估MLLMs的方法，並激發更好的評估方法，從而推動MLLM研究的進展。

English

As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building upon pre-trained LLMs, this family of models further develops multimodal perception and reasoning capabilities that are impressive, such as writing code given a flow chart or creating stories based on an image. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. Distinct from the traditional train-eval-test paradigm that only favors a single task like image classification, the versatility of MLLMs has spurred the rise of various new benchmarks and evaluation methods. In this paper, we aim to present a comprehensive survey of MLLM evaluation, discussing four key aspects: 1) the summarised benchmarks types divided by the evaluation capabilities, including foundation capabilities, model self-analysis, and extented applications; 2) the typical process of benchmark counstruction, consisting of data collection, annotation, and precautions; 3) the systematic evaluation manner composed of judge, metric, and toolkit; 4) the outlook for the next benchmark. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods, thereby driving the progress of MLLM research.

MME-Survey：關於多模態LLM評估的全面調查

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

摘要

Support