MM1.5:從多模態LLM微調中的方法、分析和見解
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
September 30, 2024
作者: Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang
cs.AI
摘要
我們提出了MM1.5,這是一個新的多模式大型語言模型(MLLMs)家族,旨在增強文本豐富的圖像理解、視覺參照和基礎定位,以及多圖像推理的能力。在MM1架構的基礎上,MM1.5採用了以數據為中心的模型訓練方法,系統性地探索了在整個模型訓練生命週期中不同數據混合的影響。這包括高質量的OCR數據和合成標題用於持續預訓練,以及針對監督微調的優化視覺指導調整數據混合。我們的模型範圍從10億到30億個參數,包括密集型和專家混合(MoE)變體,並且表明精心策劃的數據整理和訓練策略即使在小規模(10億和30億)也能產生出色的性能。此外,我們還介紹了兩個專門的變體:MM1.5-Video,用於視頻理解,以及MM1.5-UI,專為移動UI理解而設計。通過大量的實證研究和消融實驗,我們提供了有關訓練過程和決策的詳細見解,這些見解為未來MLLM發展的研究提供了有價值的指導。
English
We present MM1.5, a new family of multimodal large language models (MLLMs)
designed to enhance capabilities in text-rich image understanding, visual
referring and grounding, and multi-image reasoning. Building upon the MM1
architecture, MM1.5 adopts a data-centric approach to model training,
systematically exploring the impact of diverse data mixtures across the entire
model training lifecycle. This includes high-quality OCR data and synthetic
captions for continual pre-training, as well as an optimized visual
instruction-tuning data mixture for supervised fine-tuning. Our models range
from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE)
variants, and demonstrate that careful data curation and training strategies
can yield strong performance even at small scales (1B and 3B). Additionally, we
introduce two specialized variants: MM1.5-Video, designed for video
understanding, and MM1.5-UI, tailored for mobile UI understanding. Through
extensive empirical studies and ablations, we provide detailed insights into
the training processes and decisions that inform our final designs, offering
valuable guidance for future research in MLLM development.Summary
AI-Generated Summary