ChatPaper.aiChatPaper

MLLMs 基准测试的冗余原则

Redundancy Principles for MLLMs Benchmarks

January 20, 2025
作者: Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, Guangtao Zhai
cs.AI

摘要

随着多模态大型语言模型(MLLMs)的快速迭代和领域需求的不断发展,每年产生的基准数量激增至数百个。这种快速增长不可避免地导致基准之间存在显著的冗余。因此,关键是要退一步,对当前冗余状态进行批判性评估,并提出构建有效MLLM基准的有针对性原则。本文重点关注三个关键视角上的冗余:1)基准能力维度的冗余,2)测试问题数量的冗余,以及3)特定领域内基准之间的交叉冗余。通过对数百个MLLM在20多个基准上的性能进行全面分析,我们旨在定量衡量现有MLLM评估中存在的冗余程度,为指导未来MLLM基准的发展提供宝贵见解,并提供改进和有效解决冗余问题的策略。
English
With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs' performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively.

Summary

AI-Generated Summary

PDF282January 27, 2025