Meta-Chunken: Het Leren van Efficiënte Tekstsegmentatie via Logische Perceptie

Samenvatting

Retrieval-Augmented Generation (RAG), terwijl het een levensvatbare aanvulling is op grote taalmodellen (LLM's), ziet vaak het cruciale aspect van tekstsegmentatie binnen zijn proces over het hoofd, wat de kwaliteit van kennisintensieve taken beïnvloedt. Dit artikel introduceert het concept van Meta-Chunking, dat verwijst naar een granulariteit tussen zinnen en alinea's, bestaande uit een verzameling zinnen binnen een alinea die diepe linguïstische logische verbindingen hebben. Om Meta-Chunking te implementeren, hebben we twee strategieën ontworpen op basis van LLM's: Margin Sampling Chunking en Perplexity Chunking. De eerste maakt gebruik van LLM's om binair te classificeren of opeenvolgende zinnen moeten worden gesegmenteerd, waarbij beslissingen worden genomen op basis van het waarschijnlijkheidsverschil verkregen uit margin sampling. De laatste identificeert nauwkeurig tekstsegmentgrenzen door de kenmerken van de perplexiteitsverdeling te analyseren. Daarnaast, gezien de inherente complexiteit van verschillende teksten, stellen we een strategie voor die Meta-Chunking combineert met dynamische samenvoeging om een balans te bereiken tussen fijnkorrelige en grofkorrelige tekstsegmentatie. Experimenten uitgevoerd op elf datasets tonen aan dat Meta-Chunking de prestaties van single-hop en multi-hop vraag-antwoordtaken op basis van RAG efficiënter kan verbeteren. Bijvoorbeeld, op de 2WikiMultihopQA-dataset presteert het beter dan similarity chunking met 1,32 terwijl het slechts 45,8% van de tijd kost. Onze code is beschikbaar op https://github.com/IAAR-Shanghai/Meta-Chunking.

English

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline, which impacts the quality of knowledge-intensive tasks. This paper introduces the concept of Meta-Chunking, which refers to a granularity between sentences and paragraphs, consisting of a collection of sentences within a paragraph that have deep linguistic logical connections. To implement Meta-Chunking, we designed two strategies based on LLMs: Margin Sampling Chunking and Perplexity Chunking. The former employs LLMs to perform binary classification on whether consecutive sentences need to be segmented, making decisions based on the probability difference obtained from margin sampling. The latter precisely identifies text chunk boundaries by analyzing the characteristics of perplexity distribution. Additionally, considering the inherent complexity of different texts, we propose a strategy that combines Meta-Chunking with dynamic merging to achieve a balance between fine-grained and coarse-grained text chunking. Experiments conducted on eleven datasets demonstrate that Meta-Chunking can more efficiently improve the performance of single-hop and multi-hop question answering based on RAG. For instance, on the 2WikiMultihopQA dataset, it outperforms similarity chunking by 1.32 while only consuming 45.8% of the time. Our code is available at https://github.com/IAAR-Shanghai/Meta-Chunking.

Meta-Chunken: Het Leren van Efficiënte Tekstsegmentatie via Logische Perceptie

Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception

Samenvatting

Summary

Support