MM1.5：從多模態LLM微調中的方法、分析和見解

摘要

我們提出了MM1.5，這是一個新的多模式大型語言模型（MLLMs）家族，旨在增強文本豐富的圖像理解、視覺參照和基礎定位，以及多圖像推理的能力。在MM1架構的基礎上，MM1.5採用了以數據為中心的模型訓練方法，系統性地探索了在整個模型訓練生命週期中不同數據混合的影響。這包括高質量的OCR數據和合成標題用於持續預訓練，以及針對監督微調的優化視覺指導調整數據混合。我們的模型範圍從10億到30億個參數，包括密集型和專家混合（MoE）變體，並且表明精心策劃的數據整理和訓練策略即使在小規模（10億和30億）也能產生出色的性能。此外，我們還介紹了兩個專門的變體：MM1.5-Video，用於視頻理解，以及MM1.5-UI，專為移動UI理解而設計。通過大量的實證研究和消融實驗，我們提供了有關訓練過程和決策的詳細見解，這些見解為未來MLLM發展的研究提供了有價值的指導。

English

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

MM1.5：從多模態LLM微調中的方法、分析和見解

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

摘要

Summary

Support

Support