Infinity-MM:通过大规模和高质量的指导数据扩展多模态性能
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
October 24, 2024
作者: Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, Zhenchong Hu, Bo-Wen Zhang, Jijie Li, Dong Liang, Yingli Zhao, Yulong Ao, Yaoqi Liu, Fangxiang Feng, Guang Liu
cs.AI
摘要
视觉-语言模型(VLMs)最近取得了显著进展,但与闭源模型相比,开源指导数据的规模和质量有限,限制了它们的性能。在这项工作中,我们通过引入Infinity-MM来解决这一限制,这是一个包含4000万样本的大规模多模态指导数据集,经过严格的质量过滤和去重增强。我们还提出了一种基于开源VLMs的合成指导生成方法,利用详细的图像注释和多样化的问题生成。利用这些数据,我们训练了一个20亿参数的VLM,Aquila-VL-2B,实现了类似规模模型的最先进性能。这表明扩展指导数据并生成合成数据可以显著提高开源模型的性能。
English
Vision-Language Models (VLMs) have recently made significant progress, but
the limited scale and quality of open-source instruction data hinder their
performance compared to closed-source models. In this work, we address this
limitation by introducing Infinity-MM, a large-scale multimodal instruction
dataset with 40 million samples, enhanced through rigorous quality filtering
and deduplication. We also propose a synthetic instruction generation method
based on open-source VLMs, using detailed image annotations and diverse
question generation. Using this data, we trained a 2-billion-parameter VLM,
Aquila-VL-2B, achieving state-of-the-art (SOTA) performance for models of
similar scale. This demonstrates that expanding instruction data and generating
synthetic data can significantly improve the performance of open-source models.Summary
AI-Generated Summary