MAmmoTH-VL:通過規模化的指導調整引發多模態推理
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
December 6, 2024
作者: Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, Xiang Yue
cs.AI
摘要
開源多模式大型語言模型(MLLMs)在各種多模式任務中展現了顯著的潛力。然而,它們的推理能力仍受現有指導微調資料集的限制,這些資料集主要是從學術資料集(如VQA、AI2D和ChartQA)重新利用而來。這些資料集針對簡單的任務,僅提供詞語級別的答案,沒有任何中間的推理依據。為應對這些挑戰,我們提出了一種可擴展且具成本效益的方法,用於構建一個大規模多模式指導微調資料集,其中包含豐富的中間推理依據,旨在引發CoT推理。僅使用開源模型,我們創建了一個包含1200萬指導-回應對的資料集,以涵蓋各種具有詳細和忠實推理依據的任務。實驗表明,在這個資料集上訓練MLLMs顯著提升了推理能力,在MathVerse(+8.1%)、MMMU-Pro(+7%)和MuirBench(+13.3%)等基準測試中實現了最先進的性能。此外,該模型在非推理性基準測試中的表現也有高達4%的顯著改善。消融研究進一步凸顯了資料集構建過程中重要組件(如重寫和自我過濾)的重要性。
English
Open-source multimodal large language models (MLLMs) have shown significant
potential in a broad range of multimodal tasks. However, their reasoning
capabilities remain constrained by existing instruction-tuning datasets, which
were predominately repurposed from academic datasets such as VQA, AI2D, and
ChartQA. These datasets target simplistic tasks, and only provide phrase-level
answers without any intermediate rationales. To address these challenges, we
introduce a scalable and cost-effective method to construct a large-scale
multimodal instruction-tuning dataset with rich intermediate rationales
designed to elicit CoT reasoning. Using only open models, we create a dataset
containing 12M instruction-response pairs to cover diverse, reasoning-intensive
tasks with detailed and faithful rationales. Experiments demonstrate that
training MLLMs on this dataset significantly improves reasoning capabilities,
achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%),
MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates
notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation
studies further highlight the importance of key components, such as rewriting
and self-filtering, in the dataset construction process.Summary
AI-Generated Summary