ChatPaper.aiChatPaper

MAmmoTH-VL:通过规模化指导调整引发多模态推理

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

December 6, 2024
作者: Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, Xiang Yue
cs.AI

摘要

开源多模态大型语言模型(MLLMs)在各种多模态任务中展现出显著潜力。然而,它们的推理能力仍受现有指导微调数据集的限制,这些数据集主要是从学术数据集(如VQA、AI2D和ChartQA)中重新利用而来。这些数据集针对简单任务,仅提供短语级答案,没有任何中间推理过程。为了解决这些挑战,我们引入了一种可扩展且具有成本效益的方法,用于构建一个大规模多模态指导微调数据集,其中包含丰富的中间推理过程,旨在引发CoT推理。仅使用开放模型,我们创建了一个包含1200万指导-响应对的数据集,涵盖了多样化、推理密集型任务,并提供详细和忠实的推理过程。实验证明,在这一数据集上训练MLLMs显著提高了推理能力,在MathVerse(+8.1%)、MMMU-Pro(+7%)和MuirBench(+13.3%)等基准测试中取得了最先进的性能。此外,该模型在非推理型基准测试上的表现也有显著提升,最高可达4%。消融研究进一步凸显了数据集构建过程中重要组成部分(如重写和自过滤)的重要性。
English
Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.

Summary

AI-Generated Summary

PDF482December 9, 2024