VideoEspresso：一個大規模的思維鏈條數據集，用於通過核心幀選擇進行細粒度視頻推理

摘要

大型視覺語言模型（LVLMs）的進步顯著改善了多模式理解，然而在視頻推理任務中仍存在挑戰，原因是缺乏高質量、大規模的數據集。現有的視頻問答（VideoQA）數據集通常依賴昂貴的手動標註，具有不足的細粒度，或者依賴自動構建方法進行冗餘的逐幀分析，這限制了它們對複雜推理的可擴展性和有效性。為應對這些挑戰，我們介紹了VideoEspresso，這是一個新穎的數據集，其中包含保留關鍵空間細節和時間一致性的VideoQA對，以及中間推理步驟的多模式標註。我們的構建流程採用了一種語義感知方法來減少冗餘性，然後使用GPT-4o生成QA對。我們進一步開發了視頻Chain-of-Thought（CoT）標註，以豐富推理過程，引導GPT-4o從QA對和視頻內容中提取邏輯關係。為了充分利用高質量的VideoQA對的潛力，我們提出了一個混合LVLMs協作框架，包括一個Frame Selector和一個兩階段指令微調推理LVLM。該框架自適應地選擇核心幀並使用多模式證據進行CoT推理。在我們提出的包含14個任務的基準測試中，與9個流行的LVLMs進行評估，我們的方法在大多數任務上優於現有基準，展現出卓越的視頻推理能力。我們的代碼和數據集將在以下位置發布：https://github.com/hshjerry/VideoEspresso

English

The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso

VideoEspresso：一個大規模的思維鏈條數據集，用於通過核心幀選擇進行細粒度視頻推理

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

摘要

Summary

Support