ChatPaper.aiChatPaper

MinerU:一個用於精確文件內容提取的開源解決方案

MinerU: An Open-Source Solution for Precise Document Content Extraction

September 27, 2024
作者: Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, Conghui He
cs.AI

摘要

文件內容分析一直是計算機視覺中一個至關重要的研究領域。儘管像是OCR、版面檢測和公式識別等方法取得了顯著進展,現有的開源解決方案仍然難以因應文件類型和內容的多樣性而持續提供高質量的內容提取。為應對這些挑戰,我們提出了MinerU,這是一個用於高精度文件內容提取的開源解決方案。MinerU利用先進的PDF-Extract-Kit模型有效地從各種文件中提取內容,並採用精心調整的預處理和後處理規則來確保最終結果的準確性。實驗結果表明,MinerU在各種文件類型上始終保持高性能,顯著提升了內容提取的質量和一致性。MinerU開源項目可在https://github.com/opendatalab/MinerU找到。
English
Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU.

Summary

AI-Generated Summary

PDF284November 16, 2024