VisualWebInstruct:通过网页搜索扩展多模态指令数据规模
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
March 13, 2025
作者: Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, Wenhu Chen
cs.AI
摘要
视觉语言模型在许多以感知为核心的任务上取得了显著进展,然而,在推理导向的任务上,其进展似乎因缺乏高质量且多样化的训练数据而受限。本研究中,我们致力于解决多模态推理数据集稀缺的问题。我们提出了VisualWebInstruct——一种创新方法,利用搜索引擎构建一个跨数学、物理、金融、化学等多个学科领域的高质量、多样化数据集。从精心挑选的30,000张种子图像出发,我们运用Google图片搜索识别包含相似图像的网站,并从超过70万个独特URL来源中收集并处理HTML内容。通过内容提取、过滤与合成的流程,我们构建了一个包含约90万问答对的数据集,其中40%为视觉问答对,其余为文本问答对。基于VisualWebInstruct微调的模型展现出显著的性能提升:(1) 从Llava-OV-mid开始训练,在各项基准测试中实现了10-20%的绝对分数提升;(2) 从MAmmoTH-VL开始训练,获得了5%的绝对提升。我们的最佳模型MAmmoTH-VL2在10B参数级别内,于MMMU-Pro-std(40.7%)、MathVerse(42.6%)和DynaMath(55.7%)上展现了业界领先的性能。这些卓越成果凸显了我们数据集在增强视觉语言模型处理复杂多模态任务推理能力方面的有效性。
English
Vision-Language Models have made significant progress on many
perception-focused tasks, however, their progress on reasoning-focused tasks
seem to be limited due to the lack of high-quality and diverse training data.
In this work, we aim to address the scarcity issue of reasoning-focused
multimodal datasets. We propose VisualWebInstruct - a novel approach that
leverages search engine to create a diverse, and high-quality dataset spanning
multiple disciplines like math, physics, finance, chemistry, etc. Starting with
meticulously selected 30,000 seed images, we employ Google Image search to
identify websites containing similar images. We collect and process the HTMLs
from over 700K unique URL sources. Through a pipeline of content extraction,
filtering and synthesis, we build a dataset of approximately 900K
question-answer pairs, with 40% being visual QA pairs and the rest as text QA
pairs. Models fine-tuned on VisualWebInstruct demonstrate significant
performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point
gains across benchmarks, (2) training from MAmmoTH-VL shows 5% absoluate gain.
Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B
parameter class on MMMU-Pro-std (40.7%), MathVerse (42.6%), and DynaMath
(55.7%). These remarkable results highlight the effectiveness of our dataset in
enhancing VLMs' reasoning capabilities for complex multimodal tasks.Summary
AI-Generated Summary