文档解析揭示：结构化信息提取的技术、挑战和前景

摘要

文档解析对于将非结构化和半结构化文档（如合同、学术论文和发票）转换为结构化、可机器阅读的数据至关重要。文档解析从非结构化输入中提取可靠的结构化数据，为众多应用程序提供了巨大的便利。特别是随着大型语言模型的最新成就，文档解析在知识库构建和训练数据生成中发挥着不可或缺的作用。本调查全面审视了当前文档解析的现状，涵盖了从模块化流水线系统到由大型视觉-语言模型驱动的端到端模型的关键方法论。详细讨论了诸如布局检测、内容提取（包括文本、表格和数学表达式）以及多模态数据集成等核心组件。此外，本文还讨论了模块化文档解析系统和视觉-语言模型在处理复杂布局、集成多个模块和识别高密度文本时面临的挑战。强调了开发更大规模和更多样化数据集的重要性，并概述了未来的研究方向。

English

Document parsing is essential for converting unstructured and semi-structured documents-such as contracts, academic papers, and invoices-into structured, machine-readable data. Document parsing extract reliable structured data from unstructured inputs, providing huge convenience for numerous applications. Especially with recent achievements in Large Language Models, document parsing plays an indispensable role in both knowledge base construction and training data generation. This survey presents a comprehensive review of the current state of document parsing, covering key methodologies, from modular pipeline systems to end-to-end models driven by large vision-language models. Core components such as layout detection, content extraction (including text, tables, and mathematical expressions), and multi-modal data integration are examined in detail. Additionally, this paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts, integrating multiple modules, and recognizing high-density text. It emphasizes the importance of developing larger and more diverse datasets and outlines future research directions.

文档解析揭示：结构化信息提取的技术、挑战和前景

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

摘要

Summary

Support

Support