UPME：一种用于多模态大语言模型评估的无监督同行评审框架

摘要

多模态大语言模型（MLLMs）的兴起旨在应对视觉问答（VQA）领域的挑战，并引发了对这些模型进行客观评估的新研究热点。现有评估方法因设计视觉图像问答对所需的大量人力投入而受限，这本质上制约了评估的规模和范围。尽管自动化的MLLM-as-judge方法尝试通过自动评估减少人力负担，但往往引入了偏差。为解决这些问题，我们提出了一种无监督同行评审的MLLM评估框架。该框架仅利用图像数据，使模型能自动生成问题并对其他模型的答案进行同行评审，有效减轻了对人力的依赖。此外，我们引入了视觉语言评分系统以缓解偏差问题，该系统聚焦于三个方面：(i) 回答的正确性；(ii) 视觉理解与推理能力；(iii) 图像与文本的相关性。实验结果显示，UPME在MMstar数据集上与人评估的皮尔逊相关系数达到0.944，在ScienceQA数据集上为0.814，表明我们的框架与人工设计的基准及人类内在偏好高度一致。

English

Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce the human workload through automatic evaluations, they often introduce biases. To address these problems, we propose an Unsupervised Peer review MLLM Evaluation framework. It utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload. Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) visual understanding and reasoning; and (iii) image-text correlation. Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our framework closely aligns with human-designed benchmarks and inherent human preferences.

UPME：一种用于多模态大语言模型评估的无监督同行评审框架

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

摘要

Summary

Support

Support