ChatPaper.aiChatPaper

MM-IQ:在多模态模型中对人类抽象和推理能力进行基准测试

MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models

February 2, 2025
作者: Huanqia Cai, Yijun Yang, Winston Hu
cs.AI

摘要

智商测试一直是评估人类认知能力的基础方法,有意将评估与语言背景、语言熟练度或领域特定知识分离,以便独立核心抽象和推理能力。然而,人工智能研究目前缺乏系统基准来量化多模态系统中这些关键认知维度。为了填补这一关键空白,我们提出了MM-IQ,一个全面的评估框架,包括2,710个精心策划的测试项目,涵盖8个不同的推理范式。 通过对领先的开源和专有多模态模型进行系统评估,我们的基准测试揭示了明显的局限性:即使是最先进的架构也仅比随机机会(27.49% 对 25% 基准准确率)略有优势。这种显著的性能差距突显了当前多模态系统在逼近人类基本推理能力方面的不足,强调了需要进行开创性进展来弥合这一认知鸿沟。
English
IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms. Through systematic evaluation of leading open-source and proprietary multimodal models, our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance (27.49% vs. 25% baseline accuracy). This substantial performance chasm highlights the inadequacy of current multimodal systems in approximating fundamental human reasoning capacities, underscoring the need for paradigm-shifting advancements to bridge this cognitive divide.

Summary

AI-Generated Summary

PDF242February 4, 2025