VideoEspresso:一个用于细粒度视频推理的大规模思维链数据集,通过核心帧选择
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
November 22, 2024
作者: Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu
cs.AI
摘要
大规模视觉语言模型(LVLMs)的进展显著提高了多模态理解能力,但由于高质量、大规模数据集的稀缺性,视频推理任务仍然面临挑战。现有的视频问答(VideoQA)数据集通常依赖于昂贵的手动注释,注释粒度不足,或者采用冗余的逐帧分析的自动构建方法,限制了它们在复杂推理中的可扩展性和有效性。为了解决这些挑战,我们引入了VideoEspresso,这是一个新颖的数据集,包含保留基本空间细节和时间连贯性的VideoQA对,以及中间推理步骤的多模态注释。我们的构建流程采用了一种语义感知方法来减少冗余,然后使用GPT-4o生成问答对。我们进一步开发了视频思维链(CoT)注释,丰富推理过程,指导GPT-4o从问答对和视频内容中提取逻辑关系。为了充分利用高质量的VideoQA对的潜力,我们提出了一个混合LVLMs协作框架,包括一个帧选择器和一个两阶段指令微调推理LVLM。该框架通过自适应选择核心帧,并使用多模态证据进行CoT推理。在我们提出的包含14个任务的基准测试中,与9个流行的LVLMs进行评估,我们的方法在大多数任务上优于现有基线,展示了出色的视频推理能力。我们的代码和数据集将在以下网址发布:https://github.com/hshjerry/VideoEspresso
English
The advancement of Large Vision Language Models (LVLMs) has significantly
improved multimodal understanding, yet challenges remain in video reasoning
tasks due to the scarcity of high-quality, large-scale datasets. Existing video
question-answering (VideoQA) datasets often rely on costly manual annotations
with insufficient granularity or automatic construction methods with redundant
frame-by-frame analysis, limiting their scalability and effectiveness for
complex reasoning. To address these challenges, we introduce VideoEspresso, a
novel dataset that features VideoQA pairs preserving essential spatial details
and temporal coherence, along with multimodal annotations of intermediate
reasoning steps. Our construction pipeline employs a semantic-aware method to
reduce redundancy, followed by generating QA pairs using GPT-4o. We further
develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes,
guiding GPT-4o in extracting logical relationships from QA pairs and video
content. To exploit the potential of high-quality VideoQA pairs, we propose a
Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a
two-stage instruction fine-tuned reasoning LVLM. This framework adaptively
selects core frames and performs CoT reasoning using multimodal evidence.
Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our
method outperforms existing baselines on most tasks, demonstrating superior
video reasoning capabilities. Our code and dataset will be released at:
https://github.com/hshjerry/VideoEspressoSummary
AI-Generated Summary