通过代码引导的合成多模态数据生成扩展文本丰富图像理解
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
February 20, 2025
作者: Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark
cs.AI
摘要
针对包含丰富文本的图像(如图表和文档)进行推理,是视觉-语言模型(VLMs)的关键应用之一。然而,由于多样化的文本密集型视觉-语言数据稀缺,VLMs在这些领域往往表现不佳。为应对这一挑战,我们提出了CoSyn框架,该框架利用纯文本大语言模型(LLMs)的编码能力,自动生成合成文本密集型多模态数据。给定描述目标领域(如“营养标签”)的输入文本,CoSyn会引导LLM生成用于渲染合成图像的代码(如Python、HTML、LaTeX等)。借助这些代码作为合成图像的文本表示,CoSyn能够再次依赖纯文本LLM生成高质量的指令调优数据。通过CoSyn,我们构建了一个包含40万张图像和270万行视觉-语言指令调优数据的数据集。在七个基准测试上的全面实验表明,使用我们合成数据训练的模型在包括Llama 3.2在内的竞争性开源模型中实现了最先进的性能,并超越了GPT-4V和Gemini 1.5 Flash等专有模型。此外,CoSyn还能生成合成指向数据,使VLMs能够在输入图像中定位信息,展示了其开发能够在现实环境中执行任务的多模态代理的潜力。
English
Reasoning about images with rich text, such as charts and documents, is a
critical application of vision-language models (VLMs). However, VLMs often
struggle in these domains due to the scarcity of diverse text-rich
vision-language data. To address this challenge, we present CoSyn, a framework
that leverages the coding capabilities of text-only large language models
(LLMs) to automatically create synthetic text-rich multimodal data. Given input
text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts
an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic
images. With the underlying code as textual representations of the synthetic
images, CoSyn can generate high-quality instruction-tuning data, again relying
on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K
images and 2.7M rows of vision-language instruction-tuning data. Comprehensive
experiments on seven benchmarks demonstrate that models trained on our
synthetic data achieve state-of-the-art performance among competitive
open-source models, including Llama 3.2, and surpass proprietary models such as
GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing
data, enabling VLMs to ground information within input images, showcasing its
potential for developing multimodal agents capable of acting in real-world
environments.Summary
AI-Generated Summary