OmniDocBench:具有全面標註的多元 PDF 文件解析基準測試

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

December 10, 2024
作者: Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He
cs.AI

摘要

在計算機視覺中,文件內容提取對於滿足大型語言模型(LLMs)和檢索增強生成(RAG)技術對高質量數據的需求至關重要。然而,目前的文件解析方法在多樣性和全面評估方面存在顯著限制。為應對這些挑戰,我們引入了OmniDocBench,一個新穎的多來源基準,旨在推進自動化文件內容提取。OmniDocBench包括一個精心策劃和標註的高質量評估數據集,包括學術論文、教科書、幻燈片等九種不同類型的文件。我們的基準提供了一個靈活且全面的評估框架,具有19個版面分類標籤和14個屬性標籤,可實現對整個數據集、單個模塊或特定數據類型的多級評估。利用OmniDocBench,我們對現有的模塊化管道和多模式端到端方法進行了詳盡的比較分析,凸顯了它們在處理文件多樣性方面的局限性,確保公平評估。OmniDocBench為文件內容提取領域建立了一個堅固、多樣且公平的評估標準,為未來的進步提供了重要見解,促進了文件解析技術的發展。代碼和數據集可在https://github.com/opendatalab/OmniDocBench找到。
English
Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in https://github.com/opendatalab/OmniDocBench.

Summary

AI-Generated Summary

PDF111December 11, 2024