MAmmoTH-VL:通过规模化指导调整引发多模态推理
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
December 6, 2024
作者: Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, Xiang Yue
cs.AI
摘要
开源多模态大型语言模型(MLLMs)在各种多模态任务中展现出显著潜力。然而,它们的推理能力仍受现有指导微调数据集的限制,这些数据集主要是从学术数据集(如VQA、AI2D和ChartQA)中重新利用而来。这些数据集针对简单任务,仅提供短语级答案,没有任何中间推理过程。为了解决这些挑战,我们引入了一种可扩展且具有成本效益的方法,用于构建一个大规模多模态指导微调数据集,其中包含丰富的中间推理过程,旨在引发CoT推理。仅使用开放模型,我们创建了一个包含1200万指导-响应对的数据集,涵盖了多样化、推理密集型任务,并提供详细和忠实的推理过程。实验证明,在这一数据集上训练MLLMs显著提高了推理能力,在MathVerse(+8.1%)、MMMU-Pro(+7%)和MuirBench(+13.3%)等基准测试中取得了最先进的性能。此外,该模型在非推理型基准测试上的表现也有显著提升,最高可达4%。消融研究进一步凸显了数据集构建过程中重要组成部分(如重写和自过滤)的重要性。
English
Open-source multimodal large language models (MLLMs) have shown significant
potential in a broad range of multimodal tasks. However, their reasoning
capabilities remain constrained by existing instruction-tuning datasets, which
were predominately repurposed from academic datasets such as VQA, AI2D, and
ChartQA. These datasets target simplistic tasks, and only provide phrase-level
answers without any intermediate rationales. To address these challenges, we
introduce a scalable and cost-effective method to construct a large-scale
multimodal instruction-tuning dataset with rich intermediate rationales
designed to elicit CoT reasoning. Using only open models, we create a dataset
containing 12M instruction-response pairs to cover diverse, reasoning-intensive
tasks with detailed and faithful rationales. Experiments demonstrate that
training MLLMs on this dataset significantly improves reasoning capabilities,
achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%),
MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates
notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation
studies further highlight the importance of key components, such as rewriting
and self-filtering, in the dataset construction process.Summary
AI-Generated Summary