OCR 对 RAG 的影响：评估 OCR 对检索增强生成的级联影响

摘要

检索增强生成（RAG）通过整合外部知识来增强大型语言模型（LLMs），以减少幻觉并整合最新信息，无需重新训练。作为RAG的重要组成部分，外部知识库通常通过使用光学字符识别（OCR）从非结构化PDF文档中提取结构化数据来构建。然而，由于OCR的不完美预测和结构化数据固有的非均匀表示，知识库不可避免地包含各种OCR噪声。本文介绍了OHRBench，这是第一个用于了解OCR对RAG系统级联影响的基准。OHRBench包括来自六个真实RAG应用领域的350个精心挑选的非结构化PDF文档，以及从文档中的多模态元素中导出的问答，挑战了用于RAG的现有OCR解决方案。为了更好地理解OCR对RAG系统的影响，我们确定了两种主要类型的OCR噪声：语义噪声和格式噪声，并应用扰动生成一组具有不同程度的每种OCR噪声的结构化数据。利用OHRBench，我们首先对当前OCR解决方案进行全面评估，并揭示没有一种适合为RAG系统构建高质量知识库。然后，我们系统地评估了这两种噪声类型的影响，并展示了RAG系统的脆弱性。此外，我们讨论了在RAG系统中使用视觉语言模型（VLMs）而无需OCR的潜力。代码：https://github.com/opendatalab/OHR-Bench

English

Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: https://github.com/opendatalab/OHR-Bench

OCR 对 RAG 的影响：评估 OCR 对检索增强生成的级联影响

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

摘要

Summary

Support

Support