OmniDocBench：具有全面注释的多样化PDF文档解析基准测试

摘要

在计算机视觉中，文档内容提取对于满足大型语言模型（LLMs）和检索增强生成（RAG）技术对高质量数据的需求至关重要。然而，当前的文档解析方法在多样性和全面评估方面存在显著局限性。为了解决这些挑战，我们引入了OmniDocBench，这是一个新颖的多源基准，旨在推动自动化文档内容提取的发展。OmniDocBench包括一个精心策划和注释的高质量评估数据集，包括学术论文、教科书、幻灯片等九种不同类型的文档。我们的基准提供了一个灵活而全面的评估框架，具有19个布局类别标签和14个属性标签，可实现对整个数据集、单独模块或特定数据类型的多层次评估。利用OmniDocBench，我们对现有的模块化流程和多模式端到端方法进行了详尽的比较分析，突显了它们在处理文档多样性方面的局限性，并确保了公平评估。OmniDocBench为文档内容提取领域建立了一个强大、多样化且公平的评估标准，为未来的进展提供了重要见解，并促进了文档解析技术的发展。代码和数据集可在https://github.com/opendatalab/OmniDocBench获取。

English

Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in https://github.com/opendatalab/OmniDocBench.

OmniDocBench：具有全面注释的多样化PDF文档解析基准测试

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

摘要

Summary

Support

Support