MAmmoTH-VL：通过规模化指导调整引发多模态推理

摘要

开源多模态大型语言模型（MLLMs）在各种多模态任务中展现出显著潜力。然而，它们的推理能力仍受现有指导微调数据集的限制，这些数据集主要是从学术数据集（如VQA、AI2D和ChartQA）中重新利用而来。这些数据集针对简单任务，仅提供短语级答案，没有任何中间推理过程。为了解决这些挑战，我们引入了一种可扩展且具有成本效益的方法，用于构建一个大规模多模态指导微调数据集，其中包含丰富的中间推理过程，旨在引发CoT推理。仅使用开放模型，我们创建了一个包含1200万指导-响应对的数据集，涵盖了多样化、推理密集型任务，并提供详细和忠实的推理过程。实验证明，在这一数据集上训练MLLMs显著提高了推理能力，在MathVerse（+8.1%）、MMMU-Pro（+7%）和MuirBench（+13.3%）等基准测试中取得了最先进的性能。此外，该模型在非推理型基准测试上的表现也有显著提升，最高可达4%。消融研究进一步凸显了数据集构建过程中重要组成部分（如重写和自过滤）的重要性。

English

Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.

MAmmoTH-VL：通过规模化指导调整引发多模态推理

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

摘要

Summary

Support

Support