OmniDocBench:具有全面注释的多样化PDF文档解析基准测试
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
December 10, 2024
作者: Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He
cs.AI
摘要
在计算机视觉中,文档内容提取对于满足大型语言模型(LLMs)和检索增强生成(RAG)技术对高质量数据的需求至关重要。然而,当前的文档解析方法在多样性和全面评估方面存在显著局限性。为了解决这些挑战,我们引入了OmniDocBench,这是一个新颖的多源基准,旨在推动自动化文档内容提取的发展。OmniDocBench包括一个精心策划和注释的高质量评估数据集,包括学术论文、教科书、幻灯片等九种不同类型的文档。我们的基准提供了一个灵活而全面的评估框架,具有19个布局类别标签和14个属性标签,可实现对整个数据集、单独模块或特定数据类型的多层次评估。利用OmniDocBench,我们对现有的模块化流程和多模式端到端方法进行了详尽的比较分析,突显了它们在处理文档多样性方面的局限性,并确保了公平评估。OmniDocBench为文档内容提取领域建立了一个强大、多样化且公平的评估标准,为未来的进展提供了重要见解,并促进了文档解析技术的发展。代码和数据集可在https://github.com/opendatalab/OmniDocBench获取。
English
Document content extraction is crucial in computer vision, especially for
meeting the high-quality data needs of large language models (LLMs) and
retrieval-augmented generation (RAG) technologies. However, current document
parsing methods suffer from significant limitations in terms of diversity and
comprehensive evaluation. To address these challenges, we introduce
OmniDocBench, a novel multi-source benchmark designed to advance automated
document content extraction. OmniDocBench includes a meticulously curated and
annotated high-quality evaluation dataset comprising nine diverse document
types, such as academic papers, textbooks, slides, among others. Our benchmark
provides a flexible and comprehensive evaluation framework with 19 layout
category labels and 14 attribute labels, enabling multi-level assessments
across entire datasets, individual modules, or specific data types. Using
OmniDocBench, we perform an exhaustive comparative analysis of existing modular
pipelines and multimodal end-to-end methods, highlighting their limitations in
handling document diversity and ensuring fair evaluation. OmniDocBench
establishes a robust, diverse, and fair evaluation standard for the document
content extraction field, offering crucial insights for future advancements and
fostering the development of document parsing technologies. The codes and
dataset is available in https://github.com/opendatalab/OmniDocBench.Summary
AI-Generated Summary