OCR가 RAG를 방해한다: OCR이 검색 증강 생성에 미치는 연쇄적 영향 평가

초록

검색 증강 생성 (RAG)은 대규모 언어 모델 (LLM)을 강화하여 외부 지식을 통합하여 환각을 줄이고 다시 교육하지 않고 최신 정보를 통합합니다. RAG의 중요한 부분으로 외부 지식 베이스는 광학 문자 인식 (OCR)을 사용하여 비구조화된 PDF 문서에서 구조화된 데이터를 추출하여 일반적으로 구축됩니다. 그러나 OCR의 불완전한 예측과 구조화된 데이터의 본질적인 비균일 표현으로 인해, 지식 베이스에는 불가피하게 다양한 OCR 잡음이 포함됩니다. 본 논문에서는 RAG 시스템에 OCR의 연쇄적 영향을 이해하기 위한 첫 번째 벤치마크 인 OHRBench를 소개합니다. OHRBench에는 여섯 가지 실제 RAG 응용 분야에서 유래된 350개의 신중히 선정된 비구조화된 PDF 문서와 문서의 다중 모달 요소에서 유도된 질문 및 답변이 포함되어 있으며, RAG에 사용되는 기존 OCR 솔루션에 대한 도전적인 요소가 있습니다. OCR이 RAG 시스템에 미치는 영향을 더 잘 이해하기 위해 우리는 두 가지 주요 유형의 OCR 잡음을 식별하고 있습니다: 의미 잡음과 서식 잡음, 그리고 각 OCR 잡음의 다양한 정도의 구조화된 데이터 집합을 생성하기 위해 변형을 적용합니다. OHRBench를 사용하여, 우리는 현재 OCR 솔루션의 포괄적인 평가를 수행하고, RAG 시스템을 위한 고품질 지식 베이스를 구축하는 데 유능하지 않음을 밝힙니다. 그런 다음 이 두 가지 잡음 유형의 영향을 체계적으로 평가하고 RAG 시스템의 취약성을 시연합니다. 더 나아가, 우리는 RAG 시스템에서 OCR 없이 Vision-Language 모델 (VLM)을 활용하는 잠재력에 대해 논의합니다. 코드: https://github.com/opendatalab/OHR-Bench

English

Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: https://github.com/opendatalab/OHR-Bench

OCR가 RAG를 방해한다: OCR이 검색 증강 생성에 미치는 연쇄적 영향 평가

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

초록

Summary

Support