VideoEspresso:一個大規模的思維鏈條數據集,用於通過核心幀選擇進行細粒度視頻推理
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
November 22, 2024
作者: Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu
cs.AI
摘要
大型視覺語言模型(LVLMs)的進步顯著改善了多模式理解,然而在視頻推理任務中仍存在挑戰,原因是缺乏高質量、大規模的數據集。現有的視頻問答(VideoQA)數據集通常依賴昂貴的手動標註,具有不足的細粒度,或者依賴自動構建方法進行冗餘的逐幀分析,這限制了它們對複雜推理的可擴展性和有效性。為應對這些挑戰,我們介紹了VideoEspresso,這是一個新穎的數據集,其中包含保留關鍵空間細節和時間一致性的VideoQA對,以及中間推理步驟的多模式標註。我們的構建流程採用了一種語義感知方法來減少冗餘性,然後使用GPT-4o生成QA對。我們進一步開發了視頻Chain-of-Thought(CoT)標註,以豐富推理過程,引導GPT-4o從QA對和視頻內容中提取邏輯關係。為了充分利用高質量的VideoQA對的潛力,我們提出了一個混合LVLMs協作框架,包括一個Frame Selector和一個兩階段指令微調推理LVLM。該框架自適應地選擇核心幀並使用多模式證據進行CoT推理。在我們提出的包含14個任務的基準測試中,與9個流行的LVLMs進行評估,我們的方法在大多數任務上優於現有基準,展現出卓越的視頻推理能力。我們的代碼和數據集將在以下位置發布:https://github.com/hshjerry/VideoEspresso
English
The advancement of Large Vision Language Models (LVLMs) has significantly
improved multimodal understanding, yet challenges remain in video reasoning
tasks due to the scarcity of high-quality, large-scale datasets. Existing video
question-answering (VideoQA) datasets often rely on costly manual annotations
with insufficient granularity or automatic construction methods with redundant
frame-by-frame analysis, limiting their scalability and effectiveness for
complex reasoning. To address these challenges, we introduce VideoEspresso, a
novel dataset that features VideoQA pairs preserving essential spatial details
and temporal coherence, along with multimodal annotations of intermediate
reasoning steps. Our construction pipeline employs a semantic-aware method to
reduce redundancy, followed by generating QA pairs using GPT-4o. We further
develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes,
guiding GPT-4o in extracting logical relationships from QA pairs and video
content. To exploit the potential of high-quality VideoQA pairs, we propose a
Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a
two-stage instruction fine-tuned reasoning LVLM. This framework adaptively
selects core frames and performs CoT reasoning using multimodal evidence.
Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our
method outperforms existing baselines on most tasks, demonstrating superior
video reasoning capabilities. Our code and dataset will be released at:
https://github.com/hshjerry/VideoEspressoSummary
AI-Generated Summary