ChatPaper.aiChatPaper

VCR-Bench:一個全面的視頻鏈式推理評估框架

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

April 10, 2025
作者: Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, Feng Zhao
cs.AI

摘要

鏈式思維(CoT)推理的進步顯著提升了大型語言模型(LLMs)和大型視覺語言模型(LVLMs)的能力。然而,針對視頻CoT推理的嚴謹評估框架仍然缺失。現有的視頻基準測試未能充分評估推理過程,也未能揭示失敗是源於感知能力還是推理能力的不足。因此,我們引入了VCR-Bench,這是一個新穎的基準測試,旨在全面評估LVLMs的視頻鏈式思維推理能力。VCR-Bench包含859個涵蓋多種視頻內容和時長的視頻,以及1,034個高質量的問答對。每個問答對都手動註解了逐步的CoT推理過程,其中每一步都標記了其與感知或推理能力的關聯。此外,我們設計了七個不同的任務維度,並提出了CoT分數,基於逐步標記的CoT推理過程來評估整個CoT過程。在VCR-Bench上的廣泛實驗突顯了當前LVLMs的重大限制。即使表現最佳的模型o1,其CoT分數僅達到62.8%,準確率為56.7%,而大多數模型的分數低於40%。實驗顯示,大多數模型在感知步驟上的得分低於推理步驟,揭示了LVLMs在處理複雜視頻推理時的時間空間信息處理的關鍵瓶頸。CoT分數與準確率之間的強烈正相關性證實了我們評估框架的有效性,並強調了CoT推理在解決複雜視頻推理任務中的關鍵作用。我們希望VCR-Bench能作為一個標準化的評估框架,並揭示複雜視頻推理任務中的實際缺陷。
English
The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluation framework for video CoT reasoning remains absent. Current video benchmarks fail to adequately assess the reasoning process and expose whether failures stem from deficiencies in perception or reasoning capabilities. Therefore, we introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs' Video Chain-of-Thought Reasoning capabilities. VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs. Each pair is manually annotated with a stepwise CoT rationale, where every step is tagged to indicate its association with the perception or reasoning capabilities. Furthermore, we design seven distinct task dimensions and propose the CoT score to assess the entire CoT process based on the stepwise tagged CoT rationals. Extensive experiments on VCR-Bench highlight substantial limitations in current LVLMs. Even the top-performing model, o1, only achieves a 62.8% CoT score and an 56.7% accuracy, while most models score below 40%. Experiments show most models score lower on perception than reasoning steps, revealing LVLMs' key bottleneck in temporal-spatial information processing for complex video reasoning. A robust positive correlation between the CoT score and accuracy confirms the validity of our evaluation framework and underscores the critical role of CoT reasoning in solving complex video reasoning tasks. We hope VCR-Bench to serve as a standardized evaluation framework and expose the actual drawbacks in complex video reasoning task.

Summary

AI-Generated Summary

PDF432April 11, 2025