TeleRAG：基于前瞻检索的高效检索增强生成推理

摘要

检索增强生成（RAG）通过整合外部数据源扩展了大语言模型（LLM），以提升事实准确性和领域覆盖范围。现代RAG管道依赖于大规模数据存储，这在延迟敏感的部署场景中带来了系统挑战，尤其是在GPU内存有限的情况下。为应对这些挑战，我们提出了TeleRAG，一种高效推理系统，它能在最小化GPU内存需求的同时降低RAG延迟。TeleRAG的核心创新在于前瞻性检索机制，这是一种预取策略，能够预测所需数据并在LLM生成过程中并行地将数据从CPU传输至GPU。通过利用RAG管道的模块化特性、倒排文件索引（IVF）搜索算法以及查询间的相似性，TeleRAG实现了数据移动与计算的最优重叠。实验结果表明，与最先进的系统相比，TeleRAG平均将端到端RAG推理延迟降低了最多1.72倍，从而支持更快、更内存高效的先进RAG应用部署。

English

Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments, especially when limited GPU memory is available. To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that anticipates required data and transfers it from CPU to GPU in parallel with LLM generation. By leveraging the modularity of RAG pipelines, the inverted file index (IVF) search algorithm and similarities between queries, TeleRAG optimally overlaps data movement and computation. Experimental results show that TeleRAG reduces end-to-end RAG inference latency by up to 1.72x on average compared to state-of-the-art systems, enabling faster, more memory-efficient deployments of advanced RAG applications.

TeleRAG：基于前瞻检索的高效检索增强生成推理

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

摘要

Summary

Support

Support