APE: 적응형 병렬 인코딩을 통한 빠르고 더 긴 문맥 보강 생성

초록

컨텍스트 보강 생성(CAG) 기법인 RAG와 ICL은 사용자 쿼리에 대한 응답을 생성하기 위해 여러 컨텍스트를 효율적으로 결합해야 합니다. 이러한 컨텍스트를 직접 시퀀스로 입력하는 것은 각 요청마다 결합된 컨텍스트의 재인코딩으로 인해 상당한 계산 부담을 초래합니다. 이에 대응하기 위해 우리는 각 컨텍스트의 KV 상태를 독립적으로 사전 계산하고 캐싱하는 병렬 인코딩의 유망한 잠재력을 탐구합니다. 이 접근 방식은 추론 중에 캐싱된 상태를 직접 로드하여 여러 컨텍스트를 포함하면서도 컨텍스트 간 위치 재사용을 허용합니다. 그러나 주의 분배의 불일치로 인해 병렬 인코딩을 직접 적용하면 상당한 성능 하락이 발생합니다. 효과적이고 효율적인 CAG를 가능하게 하기 위해 우리는 적응형 병렬 인코딩(APE)을 제안합니다. 이는 병렬 인코딩의 분포를 순차 인코딩과 일치시키기 위해 공유 접두어, 주의 온도 및 스케일링 요소를 가져옵니다. RAG와 ICL 작업에 대한 결과는 APE가 동일한 입력을 사용하여 순차 인코딩 성능을 98%와 93% 유지하면서 각각 3.6%와 7.9% 우수성을 보여준다는 것을 입증합니다. 또한 APE는 많은 샷 CAG에 확장 가능하며, 병렬로 수백 개의 컨텍스트를 효과적으로 인코딩할 수 있습니다. 효율성 평가 결과, APE는 128K 길이의 컨텍스트에 대한 28배의 사전 채우기 시간을 줄이면서 엔드 투 엔드 4.5배의 가속화를 달성할 수 있습니다.

English

Context-augmented generation (CAG) techniques, including RAG and ICL, require the efficient combination of multiple contexts to generate responses to user queries. Directly inputting these contexts as a sequence introduces a considerable computational burden by re-encoding the combined selection of contexts for every request. To address this, we explore the promising potential of parallel encoding to independently pre-compute and cache each context's KV states. This approach enables the direct loading of cached states during inference while accommodating more contexts through position reuse across contexts. However, due to misalignments in attention distribution, directly applying parallel encoding results in a significant performance drop. To enable effective and efficient CAG, we propose Adaptive Parallel Encoding (APE), which brings shared prefix, attention temperature, and scaling factor to align the distribution of parallel encoding with sequential encoding. Results on RAG and ICL tasks demonstrate that APE can preserve 98% and 93% sequential encoding performance using the same inputs while outperforming parallel encoding by 3.6% and 7.9%, respectively. It also scales to many-shot CAG, effectively encoding hundreds of contexts in parallel. Efficiency evaluation shows that APE can achieve an end-to-end 4.5times speedup by reducing 28times prefilling time for a 128K-length context.

APE: 적응형 병렬 인코딩을 통한 빠르고 더 긴 문맥 보강 생성

APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

초록

Support