VideoEspresso:一個大規模的思維鏈條數據集,用於通過核心幀選擇進行細粒度視頻推理

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

November 22, 2024
作者: Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu
cs.AI

摘要

大型視覺語言模型(LVLMs)的進步顯著改善了多模式理解,然而在視頻推理任務中仍存在挑戰,原因是缺乏高質量、大規模的數據集。現有的視頻問答(VideoQA)數據集通常依賴昂貴的手動標註,具有不足的細粒度,或者依賴自動構建方法進行冗餘的逐幀分析,這限制了它們對複雜推理的可擴展性和有效性。為應對這些挑戰,我們介紹了VideoEspresso,這是一個新穎的數據集,其中包含保留關鍵空間細節和時間一致性的VideoQA對,以及中間推理步驟的多模式標註。我們的構建流程採用了一種語義感知方法來減少冗餘性,然後使用GPT-4o生成QA對。我們進一步開發了視頻Chain-of-Thought(CoT)標註,以豐富推理過程,引導GPT-4o從QA對和視頻內容中提取邏輯關係。為了充分利用高質量的VideoQA對的潛力,我們提出了一個混合LVLMs協作框架,包括一個Frame Selector和一個兩階段指令微調推理LVLM。該框架自適應地選擇核心幀並使用多模式證據進行CoT推理。在我們提出的包含14個任務的基準測試中,與9個流行的LVLMs進行評估,我們的方法在大多數任務上優於現有基準,展現出卓越的視頻推理能力。我們的代碼和數據集將在以下位置發布:https://github.com/hshjerry/VideoEspresso
English
The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso

Summary

AI-Generated Summary

PDF133November 25, 2024