OCR 阻礙了 RAG：評估 OCR 對檢索增強生成的串聯影響。

摘要

檢索增強生成（RAG）通過整合外部知識來增強大型語言模型（LLMs），以減少幻覺並納入最新信息，而無需重新訓練。作為RAG的重要組成部分，外部知識庫通常通過使用光學字符識別（OCR）從非結構化PDF文檔中提取結構化數據來構建。然而，由於OCR的不完美預測和結構化數據固有的非均勻表示，知識庫不可避免地包含各種OCR噪聲。在本文中，我們介紹了OHRBench，這是第一個用於了解OCR對RAG系統的串聯影響的基準。OHRBench包括來自六個真實世界RAG應用領域的350個精心選擇的非結構化PDF文檔，以及從文檔中的多模態元素衍生出的問答，挑戰了用於RAG的現有OCR解決方案。為了更好地了解OCR對RAG系統的影響，我們識別了兩種主要類型的OCR噪聲：語義噪聲和格式噪聲，並應用擾動生成一組具有不同程度的每種OCR噪聲的結構化數據。使用OHRBench，我們首先對當前OCR解決方案進行全面評估，並揭示沒有一個能夠構建高質量知識庫以供RAG系統使用。然後，我們系統地評估了這兩種噪聲類型的影響，並展示了RAG系統的脆弱性。此外，我們討論了在RAG系統中使用視覺語言模型（VLMs）而無需OCR的潛力。代碼：https://github.com/opendatalab/OHR-Bench

English

Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: https://github.com/opendatalab/OHR-Bench

OCR 阻礙了 RAG：評估 OCR 對檢索增強生成的串聯影響。

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

摘要

Support