ChatPaper.aiChatPaper

超越RAG:面向全面知识推理的任务感知KV缓存压缩

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

March 6, 2025
作者: Giulio Corallo, Orion Weller, Fabio Petroni, Paolo Papotti
cs.AI

摘要

将外部知识融入大型语言模型(LLMs)可显著提升其在多样化应用中的实用性,但现有方法均存在权衡。检索增强生成(RAG)通过相似性搜索获取证据,但关键信息可能不在排名靠前的结果中。长上下文模型虽能处理多份文档,却因计算成本高且受限于上下文窗口大小而受限。受学生为开卷考试浓缩学习资料的启发,我们提出了任务感知的键值(KV)缓存压缩技术,该技术能在零样本或少样本设置下压缩外部知识,使LLMs能够高效地在所有相关信息压缩后的表示上进行推理。实验表明,我们的方法在性能上超越了RAG及任务无关的压缩方法。在LongBench v2上,相较于RAG,该方法在30倍压缩率下将准确率提升了最多7个百分点,同时将推理延迟从0.43秒降至0.16秒。一个合成数据集进一步揭示,当稀疏证据足够时,RAG表现良好;而对于广泛知识任务,任务感知压缩则更为优越。
English
Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.

Summary

AI-Generated Summary

PDF207March 11, 2025