ChatPaper.aiChatPaper

视频-MMMU:评估从多学科专业视频中获取的知识获取

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

January 23, 2025
作者: Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu
cs.AI

摘要

人类通过三个认知阶段获取知识:感知信息、理解知识和调整知识以解决新问题。视频作为这一学习过程的有效媒介,促进了在这些认知阶段之间的进展。然而,现有的视频基准未能系统评估大型多模型模型(LMMs)在知识获取方面的能力。为填补这一空白,我们引入了Video-MMMU,这是一个多模态、多学科基准,旨在评估LMMs从视频中获取和利用知识的能力。Video-MMMU包含了300个专家级视频和900个人工注释问题的精选集,涵盖六个学科领域,通过阶段对齐的问题-答案对评估知识获取:感知、理解和调整。提出了一种知识增益度量,Δknowledge,用于量化观看视频后性能的提升。对LMMs的评估显示,在认知需求增加时,性能急剧下降,并突显了人类和模型知识获取之间的显著差距,强调了需要改进LMMs学习和从视频中适应的方法的必要性。
English
Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.

Summary

AI-Generated Summary

PDF262January 24, 2025