ChatPaper.aiChatPaper

Infinity-MM:通過大規模和高質量的指導數據擴展多模性能

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

October 24, 2024
作者: Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, Zhenchong Hu, Bo-Wen Zhang, Jijie Li, Dong Liang, Yingli Zhao, Yulong Ao, Yaoqi Liu, Fangxiang Feng, Guang Liu
cs.AI

摘要

視覺語言模型(VLMs)最近取得了顯著進展,但是開源指導數據的規模和質量有限,相較於封閉源模型,這限制了它們的性能。在這項工作中,我們通過引入Infinity-MM來解決這個限制,這是一個包含4,000萬樣本的大規模多模式指導數據集,通過嚴格的質量篩選和去重進行增強。我們還提出了一種基於開源VLMs的合成指導生成方法,利用詳細的圖像標註和多樣的問題生成。使用這些數據,我們訓練了一個20億參數的VLM,Aquila-VL-2B,在相似規模的模型中實現了最先進的性能。這表明擴展指導數據並生成合成數據可以顯著提高開源模型的性能。
English
Vision-Language Models (VLMs) have recently made significant progress, but the limited scale and quality of open-source instruction data hinder their performance compared to closed-source models. In this work, we address this limitation by introducing Infinity-MM, a large-scale multimodal instruction dataset with 40 million samples, enhanced through rigorous quality filtering and deduplication. We also propose a synthetic instruction generation method based on open-source VLMs, using detailed image annotations and diverse question generation. Using this data, we trained a 2-billion-parameter VLM, Aquila-VL-2B, achieving state-of-the-art (SOTA) performance for models of similar scale. This demonstrates that expanding instruction data and generating synthetic data can significantly improve the performance of open-source models.

Summary

AI-Generated Summary

PDF202November 16, 2024