IV-Bench：多模态大語言模型中基於圖像的視頻感知與推理基準

摘要

現有的多模態大型語言模型（MLLMs）評估框架主要聚焦於圖像推理或一般視頻理解任務，很大程度上忽略了圖像上下文在視頻理解中的重要作用。為彌補這一空白，我們提出了IV-Bench，這是首個用於評估基於圖像的視頻感知與推理的綜合基準。IV-Bench包含967個視頻，配備了2,585個精心註釋的圖像-文本查詢，涵蓋13項任務（7項感知任務和6項推理任務）及5個代表性類別。對當前最先進的開源（如InternVL2.5、Qwen2.5-VL）與閉源（如GPT-4o、Gemini2-Flash和Gemini2-Pro）MLLMs的廣泛評估顯示，現有模型在基於圖像的視頻感知與推理方面表現顯著不足，最高準確率僅達28.9%。進一步分析揭示了影響模型在IV-Bench上表現的關鍵因素，包括推理模式、幀數和分辨率。此外，通過一種簡單的數據合成方法，我們展示了IV-Bench的挑戰不僅限於訓練過程中數據格式的對齊。這些發現共同為未來研究提供了寶貴的見解。我們的代碼和數據已發佈於https://github.com/multimodal-art-projection/IV-Bench。

English

Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in https://github.com/multimodal-art-projection/IV-Bench.

IV-Bench：多模态大語言模型中基於圖像的視頻感知與推理基準

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

摘要

Summary

Support

Support