OCR 对 RAG 的影响:评估 OCR 对检索增强生成的级联影响
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
December 3, 2024
作者: Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, Wentao Zhang
cs.AI
摘要
检索增强生成(RAG)通过整合外部知识来增强大型语言模型(LLMs),以减少幻觉并整合最新信息,无需重新训练。作为RAG的重要组成部分,外部知识库通常通过使用光学字符识别(OCR)从非结构化PDF文档中提取结构化数据来构建。然而,由于OCR的不完美预测和结构化数据固有的非均匀表示,知识库不可避免地包含各种OCR噪声。本文介绍了OHRBench,这是第一个用于了解OCR对RAG系统级联影响的基准。OHRBench包括来自六个真实RAG应用领域的350个精心挑选的非结构化PDF文档,以及从文档中的多模态元素中导出的问答,挑战了用于RAG的现有OCR解决方案。为了更好地理解OCR对RAG系统的影响,我们确定了两种主要类型的OCR噪声:语义噪声和格式噪声,并应用扰动生成一组具有不同程度的每种OCR噪声的结构化数据。利用OHRBench,我们首先对当前OCR解决方案进行全面评估,并揭示没有一种适合为RAG系统构建高质量知识库。然后,我们系统地评估了这两种噪声类型的影响,并展示了RAG系统的脆弱性。此外,我们讨论了在RAG系统中使用视觉语言模型(VLMs)而无需OCR的潜力。代码:https://github.com/opendatalab/OHR-Bench
English
Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by
integrating external knowledge to reduce hallucinations and incorporate
up-to-date information without retraining. As an essential part of RAG,
external knowledge bases are commonly built by extracting structured data from
unstructured PDF documents using Optical Character Recognition (OCR). However,
given the imperfect prediction of OCR and the inherent non-uniform
representation of structured data, knowledge bases inevitably contain various
OCR noises. In this paper, we introduce OHRBench, the first benchmark for
understanding the cascading impact of OCR on RAG systems. OHRBench includes 350
carefully selected unstructured PDF documents from six real-world RAG
application domains, along with Q&As derived from multimodal elements in
documents, challenging existing OCR solutions used for RAG To better understand
OCR's impact on RAG systems, we identify two primary types of OCR noise:
Semantic Noise and Formatting Noise and apply perturbation to generate a set of
structured data with varying degrees of each OCR noise. Using OHRBench, we
first conduct a comprehensive evaluation of current OCR solutions and reveal
that none is competent for constructing high-quality knowledge bases for RAG
systems. We then systematically evaluate the impact of these two noise types
and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the
potential of employing Vision-Language Models (VLMs) without OCR in RAG
systems. Code: https://github.com/opendatalab/OHR-BenchSummary
AI-Generated Summary