ChatPaper.aiChatPaper

MoC:基于文本分块学习器混合的检索增强生成系统

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

March 12, 2025
作者: Jihao Zhao, Zhiyuan Ji, Zhaoxin Fan, Hanyu Wang, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li
cs.AI

摘要

检索增强生成(RAG)作为大型语言模型(LLMs)的有效补充,却常忽视其流程中文本分块这一关键环节。本文首先提出了一种双指标评估方法,包含边界清晰度与分块粘性,旨在直接量化分块质量。借助这一评估手段,我们揭示了传统及语义分块在处理复杂上下文细微差别时的固有局限,从而证实了将LLMs融入分块过程的必要性。针对基于LLM方法在计算效率与分块精度之间固有的权衡问题,我们设计了粒度感知的混合分块器(MoC)框架,该框架包含一个三阶段处理机制。尤为重要的是,我们的目标是引导分块器生成一系列结构化的分块正则表达式,随后利用这些表达式从原始文本中提取分块。大量实验表明,我们提出的指标与MoC框架均有效解决了分块任务中的挑战,不仅揭示了分块的核心机制,还显著提升了RAG系统的性能。
English
Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.

Summary

AI-Generated Summary

PDF33March 13, 2025