ChatPaper.aiChatPaper

IV-Bench:多模态大语言模型中图像引导视频感知与推理的基准测试

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

April 21, 2025
作者: David Ma, Yuanxing Zhang, Jincheng Ren, Jarvis Guo, Yifan Yao, Zhenlin Wei, Zhenzhu Yang, Zhongyuan Peng, Boyu Feng, Jun Ma, Xiao Gu, Zhoufutu Wen, King Zhu, Yancheng He, Meng Cao, Shiwen Ni, Jiaheng Liu, Wenhao Huang, Ge Zhang, Xiaojie Jin
cs.AI

摘要

现有的多模态大语言模型(MLLMs)评估框架主要集中于图像推理或通用视频理解任务,很大程度上忽视了图像上下文在视频理解中的重要作用。为填补这一空白,我们提出了IV-Bench,这是首个用于评估图像引导视频感知与推理的综合基准。IV-Bench包含967个视频,配以2,585个精心标注的图像-文本查询,覆盖13项任务(7项感知任务和6项推理任务)及5个代表性类别。通过对当前最先进的开源(如InternVL2.5、Qwen2.5-VL)和闭源(如GPT-4o、Gemini2-Flash和Gemini2-Pro)MLLMs的广泛评估,我们发现现有模型在图像引导视频感知与推理方面表现显著不足,最高准确率仅为28.9%。进一步分析揭示了影响模型在IV-Bench上表现的关键因素,包括推理模式、帧数和分辨率。此外,通过一种简单的数据合成方法,我们展示了IV-Bench的挑战不仅限于训练过程中数据格式的对齐。这些发现共同为未来研究提供了宝贵的洞见。我们的代码和数据已发布于https://github.com/multimodal-art-projection/IV-Bench。
English
Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in https://github.com/multimodal-art-projection/IV-Bench.

Summary

AI-Generated Summary

PDF182April 23, 2025