面向文档理解的令牌级文本图像基础模型
A Token-level Text Image Foundation Model for Document Understanding
March 4, 2025
作者: Tongkun Guan, Zining Wang, Pei Fu, Zhengtao Guo, Wei Shen, Kai Zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang
cs.AI
摘要
近年来,通用视觉基础模型(VFMs)的应用日益广泛,尤其是在作为流行多模态大语言模型(MLLMs)的图像编码器方面。然而,由于缺乏语义细粒度的监督,这些模型在下游与文本图像相关的任务中——即对包含细小密集文本图像的感知、理解与推理——仍面临根本性的预测错误。为填补这一空白,我们开发了TokenOCR,这是首个专为文本图像相关任务定制的令牌级视觉基础模型,旨在支持多种传统下游应用。为促进TokenOCR的预训练,我们还设计了一套高质量的数据生成流程,构建了首个令牌级图像文本数据集TokenIT,包含2000万张图像和18亿个令牌-掩码对。此外,凭借这一具备卓越图像即文本能力的基础,我们无缝地将TokenOCR替代先前的VFMs,构建了面向基于VQA的文档理解任务的文档级MLLM——TokenVL。最终,大量实验验证了TokenOCR与TokenVL的有效性。代码、数据集及权重将发布于https://token-family.github.io/TokenOCR_project。
English
In recent years, general visual foundation models (VFMs) have witnessed
increasing adoption, particularly as image encoders for popular multi-modal
large language models (MLLMs). However, without semantically fine-grained
supervision, these models still encounter fundamental prediction errors in the
context of downstream text-image-related tasks, i.e., perception, understanding
and reasoning with images containing small and dense texts. To bridge this gap,
we develop TokenOCR, the first token-level visual foundation model specifically
tailored for text-image-related tasks, designed to support a variety of
traditional downstream applications. To facilitate the pretraining of TokenOCR,
we also devise a high-quality data production pipeline that constructs the
first token-level image text dataset, TokenIT, comprising 20 million images and
1.8 billion token-mask pairs. Furthermore, leveraging this foundation with
exceptional image-as-text capability, we seamlessly replace previous VFMs with
TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document
understanding tasks. Finally, extensive experiments demonstrate the
effectiveness of TokenOCR and TokenVL. Code, datasets, and weights will be
available at https://token-family.github.io/TokenOCR_project.Summary
AI-Generated Summary