更多文档,相同长度:剖析RAG中多文档处理的挑战
More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
March 6, 2025
作者: Shahar Levy, Nir Mazor, Lihi Shalmon, Michael Hassid, Gabriel Stanovsky
cs.AI
摘要
检索增强生成(RAG)为大型语言模型(LLMs)提供了相关文档。尽管先前的研究指出,检索大量文档可能会降低性能,但它们并未在控制上下文长度的前提下,单独考察文档数量对性能的影响。我们基于多跳问答任务构建了定制数据集,对多种语言模型进行了评估。在保持上下文长度及相关信息位置不变的同时,我们调整了文档数量,发现增加RAG设置中的文档数量对LLMs构成了显著挑战。此外,我们的结果表明,处理多个文档与处理长上下文是两种不同的挑战。我们同时公开了数据集与代码:https://github.com/shaharl6000/MoreDocsSameLen。
English
Retrieval-augmented generation (RAG) provides LLMs with relevant documents.
Although previous studies noted that retrieving many documents can degrade
performance, they did not isolate how the quantity of documents affects
performance while controlling for context length. We evaluate various language
models on custom datasets derived from a multi-hop QA task. We keep the context
length and position of relevant information constant while varying the number
of documents, and find that increasing the document count in RAG settings poses
significant challenges for LLMs. Additionally, our results indicate that
processing multiple documents is a separate challenge from handling long
contexts. We also make the datasets and code available:
https://github.com/shaharl6000/MoreDocsSameLen .Summary
AI-Generated Summary