VideoEspresso：一个用于细粒度视频推理的大规模思维链数据集，通过核心帧选择

摘要

大规模视觉语言模型（LVLMs）的进展显著提高了多模态理解能力，但由于高质量、大规模数据集的稀缺性，视频推理任务仍然面临挑战。现有的视频问答（VideoQA）数据集通常依赖于昂贵的手动注释，注释粒度不足，或者采用冗余的逐帧分析的自动构建方法，限制了它们在复杂推理中的可扩展性和有效性。为了解决这些挑战，我们引入了VideoEspresso，这是一个新颖的数据集，包含保留基本空间细节和时间连贯性的VideoQA对，以及中间推理步骤的多模态注释。我们的构建流程采用了一种语义感知方法来减少冗余，然后使用GPT-4o生成问答对。我们进一步开发了视频思维链（CoT）注释，丰富推理过程，指导GPT-4o从问答对和视频内容中提取逻辑关系。为了充分利用高质量的VideoQA对的潜力，我们提出了一个混合LVLMs协作框架，包括一个帧选择器和一个两阶段指令微调推理LVLM。该框架通过自适应选择核心帧，并使用多模态证据进行CoT推理。在我们提出的包含14个任务的基准测试中，与9个流行的LVLMs进行评估，我们的方法在大多数任务上优于现有基线，展示了出色的视频推理能力。我们的代码和数据集将在以下网址发布：https://github.com/hshjerry/VideoEspresso

English

The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso

VideoEspresso：一个用于细粒度视频推理的大规模思维链数据集，通过核心帧选择

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

摘要

Summary

Support

Support