ChatPaper.aiChatPaper

Éclair -- 通过集成阅读顺序提取文档内容和布局

Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents

February 6, 2025
作者: Ilia Karmanov, Amala Sanjay Deshmukh, Lukas Voegtle, Philipp Fischer, Kateryna Chumachenko, Timo Roman, Jarno Seppänen, Jupinder Parmar, Joseph Jennings, Andrew Tao, Karan Sapra
cs.AI

摘要

光学字符识别(OCR)技术被广泛应用于从文档图像中提取文本,促进高效的数字化和数据检索。然而,仅仅提取文本在处理复杂文档时是不够的。要充分理解这些文档,需要理解它们的结构,包括格式、公式、表格,以及跨多个页面的多个块和列的阅读顺序,还需要语义信息来检测脚注和图片标题等元素。这种全面的理解对于下游任务至关重要,如检索、文档问答以及为训练大型语言模型(LLMs)和视觉语言模型(VLMs)进行数据整理。为了解决这个问题,我们引入了“Éclair”,这是一个通用的文本提取工具,专门设计用于处理各种类型的文档。给定一幅图像,“Éclair”能够按阅读顺序提取格式化文本,同时提取边界框及其对应的语义类别。为了全面评估这些新颖功能,我们引入了我们多样化的人工标注的文档级OCR和语义分类基准。在这个基准上,“Éclair”实现了最先进的准确性,优于其他方法在关键指标上的表现。此外,我们还在已建立的基准上评估了“Éclair”,展示了它在几个评估标准上的多功能性和强大性。
English
Optical Character Recognition (OCR) technology is widely used to extract text from images of documents, facilitating efficient digitization and data retrieval. However, merely extracting text is insufficient when dealing with complex documents. Fully comprehending such documents requires an understanding of their structure -- including formatting, formulas, tables, and the reading order of multiple blocks and columns across multiple pages -- as well as semantic information for detecting elements like footnotes and image captions. This comprehensive understanding is crucial for downstream tasks such as retrieval, document question answering, and data curation for training Large Language Models (LLMs) and Vision Language Models (VLMs). To address this, we introduce \'Eclair, a general-purpose text-extraction tool specifically designed to process a wide range of document types. Given an image, \'Eclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. To thoroughly evaluate these novel capabilities, we introduce our diverse human-annotated benchmark for document-level OCR and semantic classification. \'Eclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics. Additionally, we evaluate \'Eclair on established benchmarks, demonstrating its versatility and strength across several evaluation standards.

Summary

AI-Generated Summary

PDF113February 12, 2025