ChatPaper.aiChatPaper

VCR-Bench:视频链式思维推理的综合评估框架

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

April 10, 2025
作者: Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, Feng Zhao
cs.AI

摘要

思维链(CoT)推理的进步显著提升了大型语言模型(LLMs)和大型视觉语言模型(LVLMs)的能力。然而,针对视频CoT推理的严格评估框架仍然缺失。当前的视频基准测试未能充分评估推理过程,也无法揭示失败是源于感知能力还是推理能力的不足。因此,我们引入了VCR-Bench,这是一个旨在全面评估LVLMs视频思维链推理能力的新基准。VCR-Bench包含859个涵盖多种视频内容和时长的视频,以及1,034对高质量的问题-答案对。每对问题-答案都手动标注了逐步的CoT推理过程,其中每一步都标记了其与感知或推理能力的关联。此外,我们设计了七个不同的任务维度,并提出了CoT评分,以基于逐步标记的CoT推理过程来评估整个CoT过程。在VCR-Bench上的大量实验揭示了当前LVLMs的显著局限性。即使是表现最好的模型o1,也仅获得了62.8%的CoT评分和56.7%的准确率,而大多数模型的得分低于40%。实验表明,大多数模型在感知步骤上的得分低于推理步骤,揭示了LVLMs在处理复杂视频推理任务时时空信息处理的关键瓶颈。CoT评分与准确率之间的强正相关性证实了我们评估框架的有效性,并强调了CoT推理在解决复杂视频推理任务中的关键作用。我们希望VCR-Bench能作为一个标准化的评估框架,揭示复杂视频推理任务中的实际缺陷。
English
The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluation framework for video CoT reasoning remains absent. Current video benchmarks fail to adequately assess the reasoning process and expose whether failures stem from deficiencies in perception or reasoning capabilities. Therefore, we introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs' Video Chain-of-Thought Reasoning capabilities. VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs. Each pair is manually annotated with a stepwise CoT rationale, where every step is tagged to indicate its association with the perception or reasoning capabilities. Furthermore, we design seven distinct task dimensions and propose the CoT score to assess the entire CoT process based on the stepwise tagged CoT rationals. Extensive experiments on VCR-Bench highlight substantial limitations in current LVLMs. Even the top-performing model, o1, only achieves a 62.8% CoT score and an 56.7% accuracy, while most models score below 40%. Experiments show most models score lower on perception than reasoning steps, revealing LVLMs' key bottleneck in temporal-spatial information processing for complex video reasoning. A robust positive correlation between the CoT score and accuracy confirms the validity of our evaluation framework and underscores the critical role of CoT reasoning in solving complex video reasoning tasks. We hope VCR-Bench to serve as a standardized evaluation framework and expose the actual drawbacks in complex video reasoning task.

Summary

AI-Generated Summary

PDF452April 11, 2025