LLMtimesMapReduce: 대규모 언어 모델을 활용한 간소화된 장문 처리

초록

대형 언어 모델(LLMs)의 문맥 창 확장은 특히 극도로 긴 텍스트를 다루는 응용 프로그램에 대해 중요한 연구 분야가 되었습니다. 본 연구에서는 긴 텍스트를 처리하기 위한 새로운 훈련 불필요한 프레임워크를 제안하며, 포괄적인 문서 이해를 달성하기 위해 분할 정복 전략을 활용합니다. 제안된 LLMtimesMapReduce 프레임워크는 전체 문서를 여러 청크로 분할하여 LLM이 읽도록 하고 중간 답변을 집계하여 최종 출력물을 생성합니다. 분할 정복 방식의 긴 텍스트 처리 프레임워크의 주요 도전 과제는 문서를 분할할 때 중요한 장거리 정보를 잃을 위험이 있으며, 이는 모델이 세분화된 텍스트를 기반으로 불완전하거나 잘못된 답변을 생성하게 할 수 있습니다. 중단된 장거리 정보는 청크 간 종속성과 청크 간 충돌 두 가지 범주로 분류될 수 있습니다. 우리는 중첩 청크 종속성을 더 잘 다루기 위한 구조화된 정보 프로토콜을 설계하고, 중첩 청크 충돌을 해결하기 위한 문맥 내 신뢰 보정 메커니즘을 개발했습니다. 실험 결과는 LLMtimesMapReduce가 대표적인 오픈 소스 및 상용 긴 문맥 LLMs보다 우수한 성능을 보이며, 여러 다른 모델에도 적용 가능함을 보여줍니다.

English

Enlarging the context window of large language models (LLMs) has become a crucial research area, particularly for applications involving extremely long texts. In this work, we propose a novel training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding. The proposed LLMtimesMapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output. The main challenge for divide-and-conquer long text processing frameworks lies in the risk of losing essential long-range information when splitting the document, which can lead the model to produce incomplete or incorrect answers based on the segmented texts. Disrupted long-range information can be classified into two categories: inter-chunk dependency and inter-chunk conflict. We design a structured information protocol to better cope with inter-chunk dependency and an in-context confidence calibration mechanism to resolve inter-chunk conflicts. Experimental results demonstrate that LLMtimesMapReduce can outperform representative open-source and commercial long-context LLMs, and is applicable to several different models.

LLMtimesMapReduce: 대규모 언어 모델을 활용한 간소화된 장문 처리

LLMtimesMapReduce: Simplified Long-Sequence Processing using Large Language Models

초록

Support