코드 기반 합성 멀티모달 데이터 생성을 통한 텍스트 풍부 이미지 이해의 확장

초록

차트와 문서와 같은 풍부한 텍스트가 포함된 이미지에 대한 추론은 시각-언어 모델(VLMs)의 중요한 응용 분야입니다. 그러나 VLMs는 다양한 텍스트 중심의 시각-언어 데이터가 부족하기 때문에 이러한 영역에서 종종 어려움을 겪습니다. 이러한 문제를 해결하기 위해, 우리는 텍스트 전용 대형 언어 모델(LLMs)의 코딩 능력을 활용하여 합성 텍스트 중심의 다중모드 데이터를 자동으로 생성하는 CoSyn 프레임워크를 제안합니다. 대상 도메인(예: "영양 성분 표")을 설명하는 입력 텍스트가 주어지면, CoSyn은 LLM에게 합성 이미지를 렌더링하기 위한 코드(Python, HTML, LaTeX 등)를 생성하도록 프롬프트합니다. 합성 이미지의 텍스트 표현으로서의 기본 코드를 통해, CoSyn은 다시 텍스트 전용 LLM을 사용하여 고품질의 지시 튜닝 데이터를 생성할 수 있습니다. CoSyn을 사용하여, 우리는 40만 개의 이미지와 270만 개의 시각-언어 지시 튜닝 데이터로 구성된 데이터셋을 구축했습니다. 7개의 벤치마크에 대한 포괄적인 실험 결과, 우리의 합성 데이터로 훈련된 모델은 Llama 3.2를 포함한 경쟁적인 오픈소스 모델들 중에서 최고의 성능을 달성했으며, GPT-4V와 Gemini 1.5 Flash와 같은 독점 모델들을 능가했습니다. 또한, CoSyn은 합성 포인팅 데이터를 생성할 수 있어, VLMs가 입력 이미지 내에서 정보를 기반으로 할 수 있게 하며, 이는 실제 환경에서 작동할 수 있는 다중모드 에이전트 개발의 잠재력을 보여줍니다.

English

Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.

코드 기반 합성 멀티모달 데이터 생성을 통한 텍스트 풍부 이미지 이해의 확장

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

초록

Support