Open-Qwen2VL:基于学术资源的高效计算预训练全开放多模态大语言模型
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
April 1, 2025
作者: Weizhi Wang, Yu Tian, Linjie Yang, Heng Wang, Xifeng Yan
cs.AI
摘要
复现顶尖的多模态大语言模型(LLM)预训练在流程的每个阶段都面临障碍,包括高质量数据筛选、多模态数据混合策略、序列打包技术以及训练框架。我们推出了Open-Qwen2VL,这是一个完全开源的20亿参数多模态大语言模型,仅使用442个A100-40G GPU小时,在2900万图文对上高效完成了预训练。我们的方法采用从低到高的动态图像分辨率和多模态序列打包,显著提升了预训练效率。训练数据集通过结合基于MLLM的筛选技术(如MLM-Filter)和传统的CLIP筛选方法精心挑选,大幅提高了数据质量和训练效率。Open-Qwen2VL的预训练在UCSB的8个A100-40G GPU上完成,处理了50亿个打包的多模态token,仅占Qwen2-VL 1.4万亿多模态预训练token的0.36%。最终经过指令微调的Open-Qwen2VL在MMBench、SEEDBench、MMstar和MathVista等多个多模态基准测试中超越了部分开源的顶尖MLLM Qwen2-VL-2B,展现了Open-Qwen2VL卓越的训练效率。我们开源了工作的所有方面,包括计算高效和数据高效的训练细节、数据筛选方法、序列打包脚本、WebDataset格式的预训练数据、基于FSDP的训练代码库,以及基础和指令微调的模型检查点。我们重新定义了多模态LLM的“完全开源”,即完整发布:1)训练代码库,2)详细的数据筛选技术,以及3)用于模型开发的所有预训练和监督微调数据。
English
The reproduction of state-of-the-art multimodal LLM pre-training faces
barriers at every stage of the pipeline, including high-quality data filtering,
multimodal data mixture strategies, sequence packing techniques, and training
frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter
Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs
using only 442 A100-40G GPU hours. Our approach employs low-to-high dynamic
image resolution and multimodal sequence packing to significantly enhance
pre-training efficiency. The training dataset was carefully curated using both
MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based
filtering methods, substantially improving data quality and training
efficiency. The Open-Qwen2VL pre-training is conducted on academic level
8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36\% of 1.4T
multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned
Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on
various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista,
indicating the remarkable training efficiency of Open-Qwen2VL. We open-source
all aspects of our work, including compute-efficient and data-efficient
training details, data filtering methods, sequence packing scripts,
pre-training data in WebDataset format, FSDP-based training codebase, and both
base and instruction-tuned model checkpoints. We redefine "fully open" for
multimodal LLMs as the complete release of: 1) the training codebase, 2)
detailed data filtering techniques, and 3) all pre-training and supervised
fine-tuning data used to develop the model.Summary
AI-Generated Summary