메타-청킹: 논리적 인식을 통한 효율적인 텍스트 분할 학습

초록

검색 증강 생성 (RAG)은 대규모 언어 모델 (LLM)에 유용한 보완 역할을 하지만 종종 파이프라인 내의 텍스트 청킹이 중요한 측면으로 간과되어 지식 중심 작업의 품질에 영향을 미칩니다. 본 논문에서는 문장과 단락 사이의 세분성을 나타내는 Meta-Chunking 개념을 소개합니다. 이는 문단 내에서 깊은 언어 논리적 연결을 갖는 문장 모음으로 구성됩니다. Meta-Chunking을 구현하기 위해 LLM을 기반으로 Margin Sampling Chunking과 Perplexity Chunking 두 가지 전략을 설계했습니다. 전자는 LLM을 활용하여 연속된 문장을 분할해야 하는지에 대한 이진 분류를 수행하며, 마진 샘플링에서 얻은 확률 차이에 기반하여 결정을 내립니다. 후자는 헷갈리는 분포의 특성을 분석하여 텍스트 청킹 경계를 정확히 식별합니다. 또한, 서로 다른 텍스트의 본질적인 복잡성을 고려하여, Meta-Chunking과 동적 병합을 결합하여 세밀한 및 거친 텍스트 청킹 사이의 균형을 달성하는 전략을 제안합니다. 11개 데이터셋에서 수행된 실험 결과, Meta-Chunking이 RAG를 기반으로 한 단일-점프 및 다중-점프 질문 응답의 성능을 효율적으로 향상시킬 수 있음을 보여줍니다. 예를 들어, 2WikiMultihopQA 데이터셋에서 유사성 청킹을 1.32만큼 능가하면서 시간의 45.8%만 소비합니다. 저희의 코드는 https://github.com/IAAR-Shanghai/Meta-Chunking에서 확인하실 수 있습니다.

English

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline, which impacts the quality of knowledge-intensive tasks. This paper introduces the concept of Meta-Chunking, which refers to a granularity between sentences and paragraphs, consisting of a collection of sentences within a paragraph that have deep linguistic logical connections. To implement Meta-Chunking, we designed two strategies based on LLMs: Margin Sampling Chunking and Perplexity Chunking. The former employs LLMs to perform binary classification on whether consecutive sentences need to be segmented, making decisions based on the probability difference obtained from margin sampling. The latter precisely identifies text chunk boundaries by analyzing the characteristics of perplexity distribution. Additionally, considering the inherent complexity of different texts, we propose a strategy that combines Meta-Chunking with dynamic merging to achieve a balance between fine-grained and coarse-grained text chunking. Experiments conducted on eleven datasets demonstrate that Meta-Chunking can more efficiently improve the performance of single-hop and multi-hop question answering based on RAG. For instance, on the 2WikiMultihopQA dataset, it outperforms similarity chunking by 1.32 while only consuming 45.8% of the time. Our code is available at https://github.com/IAAR-Shanghai/Meta-Chunking.

메타-청킹: 논리적 인식을 통한 효율적인 텍스트 분할 학습

Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception

초록

Support