OCR 阻礙了 RAG:評估 OCR 對檢索增強生成的串聯影響。

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

December 3, 2024
作者: Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, Wentao Zhang
cs.AI

摘要

檢索增強生成(RAG)通過整合外部知識來增強大型語言模型(LLMs),以減少幻覺並納入最新信息,而無需重新訓練。作為RAG的重要組成部分,外部知識庫通常通過使用光學字符識別(OCR)從非結構化PDF文檔中提取結構化數據來構建。然而,由於OCR的不完美預測和結構化數據固有的非均勻表示,知識庫不可避免地包含各種OCR噪聲。在本文中,我們介紹了OHRBench,這是第一個用於了解OCR對RAG系統的串聯影響的基準。OHRBench包括來自六個真實世界RAG應用領域的350個精心選擇的非結構化PDF文檔,以及從文檔中的多模態元素衍生出的問答,挑戰了用於RAG的現有OCR解決方案。為了更好地了解OCR對RAG系統的影響,我們識別了兩種主要類型的OCR噪聲:語義噪聲和格式噪聲,並應用擾動生成一組具有不同程度的每種OCR噪聲的結構化數據。使用OHRBench,我們首先對當前OCR解決方案進行全面評估,並揭示沒有一個能夠構建高質量知識庫以供RAG系統使用。然後,我們系統地評估了這兩種噪聲類型的影響,並展示了RAG系統的脆弱性。此外,我們討論了在RAG系統中使用視覺語言模型(VLMs)而無需OCR的潛力。代碼:https://github.com/opendatalab/OHR-Bench
English
Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: https://github.com/opendatalab/OHR-Bench

Summary

AI-Generated Summary

Paper Overview

The paper introduces OHRBench, a benchmark for evaluating OCR's impact on Retrieval-Augmented Generation (RAG) systems. It identifies Semantic and Formatting Noise types, perturbs data accordingly, and evaluates OCR solutions' performance. The study highlights the vulnerability of RAG systems to OCR noise and discusses the potential of Vision-Language Models (VLMs) in RAG applications.

Core Contribution

  • Introduction of OHRBench benchmark for assessing OCR impact on RAG systems.
  • Identification and perturbation of Semantic and Formatting Noise types.
  • Comprehensive evaluation of OCR solutions and their impact on RAG performance.
  • Analysis of the vulnerability of RAG systems to OCR noise.
  • Discussion on the potential of Vision-Language Models in RAG applications.

Research Context

The paper addresses the gap in existing benchmarks by focusing on OCR's cascading impact on RAG systems. It explores the effects of Semantic and Formatting Noise on RAG components, evaluates OCR solutions comprehensively, and discusses the potential of VLMs in enhancing RAG performance.

Keywords

OCR, Retrieval-Augmented Generation (RAG), OHRBench, Semantic Noise, Formatting Noise, Vision-Language Models (VLMs), Benchmarking, Multimodal Elements, Knowledge Bases

Background

The research background involves the need to evaluate OCR's impact on RAG systems. The study aims to address the lack of benchmarks focusing on OCR noise effects on RAG components. By perturbing data with Semantic and Formatting Noise, the paper aims to assess OCR solutions' competency for constructing high-quality knowledge bases for RAG systems.

Research Gap

Existing literature lacks benchmarks that specifically evaluate OCR's impact on RAG systems.

Technical Challenges

Challenges include identifying and perturbing Semantic and Formatting Noise types in OCR data.

Prior Approaches

Existing solutions have not comprehensively evaluated OCR's impact on constructing knowledge bases for RAG systems.

Methodology

The research methodology involves perturbing data with Semantic and Formatting Noise to evaluate OCR solutions' performance and their impact on RAG systems.

Theoretical Foundation

The study is based on assessing OCR noise effects on RAG components and systems.

Technical Architecture

Data perturbation involves introducing Semantic and Formatting Noise to mimic OCR errors.

Implementation Details

Specific algorithms and tools are used to generate perturbed data and evaluate OCR solutions.

Innovation Points

The study innovates by introducing OHRBench, identifying OCR noise types, evaluating OCR solutions comprehensively, and analyzing OCR noise's impact on RAG systems.

Experimental Validation

The experimental validation assesses OCR solutions' performance on RAG systems using perturbed data with Semantic and Formatting Noise.

Setup

Data perturbation involves introducing varying levels of Semantic and Formatting Noise.

Metrics

Evaluation metrics include LCS@1, LCS@5, EM, F1, EM@1, and F1@1 for assessing OCR solutions' performance.

Results

Results show the impact of Semantic and Formatting Noise on RAG systems and OCR solutions' competency.

Comparative Analysis

The study compares OCR solutions' performance across different domains and evaluates the impact of noise on RAG components.

Impact and Implications

The research findings have implications for improving OCR solutions for RAG systems and highlight the potential of Vision-Language Models in enhancing RAG performance.

Key Findings

The study reveals the vulnerability of RAG systems to OCR noise and the need for improved OCR solutions.

Limitations

Current OCR solutions exhibit performance loss in RAG applications, indicating the need for advancements.

Future Directions

Future research can focus on developing OCR solutions resilient to Semantic and Formatting Noise.

Practical Significance

The study's findings can lead to the development of more robust OCR systems for RAG applications.

熱門論文

1比特LLM時代:所有大型語言模型都在1.58比特。
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu WeiFeb 27, 2024612142

DeepSeek-R1:通過強化學習激勵LLM中的推理能力
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen ZhangJan 22, 20253735

Qwen2.5 技術報告
Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan QiuDec 19, 202436311

PDF222December 4, 2024