DocLayout-YOLO：通過多樣合成數據和全局到局部的自適應感知增強文件版面分析

摘要

文件版面分析對於真實世界的文件理解系統至關重要，但在速度和準確性之間存在著一個具有挑戰性的折衷：利用文本和視覺特徵的多模態方法可以實現更高的準確性，但會遭受顯著的延遲，而僅依賴視覺特徵的單模態方法則可以提供更快的處理速度，但準確性則會受到影響。為了應對這一困境，我們引入了DocLayout-YOLO，這是一種新穎的方法，通過在預訓練和模型設計中進行文檔特定的優化，從而提高準確性的同時保持速度優勢。為了實現強大的文檔預訓練，我們引入了Mesh-candidate BestFit算法，將文檔合成框架化為二維裝箱問題，生成了規模龐大且多樣化的DocSynth-300K數據集。在由此產生的DocSynth-300K數據集上進行預訓練顯著提高了各種文檔類型的微調性能。在模型優化方面，我們提出了一個全局到局部可控感受模塊，能夠更好地處理文檔元素的多尺度變化。此外，為了驗證在不同文檔類型上的性能，我們引入了一個複雜且具有挑戰性的基準測試，名為DocStructBench。對下游數據集進行的大量實驗表明，DocLayout-YOLO在速度和準確性方面表現優異。代碼、數據和模型可在https://github.com/opendatalab/DocLayout-YOLO 上獲得。

English

Document Layout Analysis is crucial for real-world document understanding systems, but it encounters a challenging trade-off between speed and accuracy: multimodal methods leveraging both text and visual features achieve higher accuracy but suffer from significant latency, whereas unimodal methods relying solely on visual features offer faster processing speeds at the expense of accuracy. To address this dilemma, we introduce DocLayout-YOLO, a novel approach that enhances accuracy while maintaining speed advantages through document-specific optimizations in both pre-training and model design. For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm, which frames document synthesis as a two-dimensional bin packing problem, generating the large-scale, diverse DocSynth-300K dataset. Pre-training on the resulting DocSynth-300K dataset significantly improves fine-tuning performance across various document types. In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module that is capable of better handling multi-scale variations of document elements. Furthermore, to validate performance across different document types, we introduce a complex and challenging benchmark named DocStructBench. Extensive experiments on downstream datasets demonstrate that DocLayout-YOLO excels in both speed and accuracy. Code, data, and models are available at https://github.com/opendatalab/DocLayout-YOLO.

DocLayout-YOLO：通過多樣合成數據和全局到局部的自適應感知增強文件版面分析

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

摘要

Summary

Support

Support