FreshStack:构建用于评估技术文档检索的现实基准
FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents
April 17, 2025
作者: Nandan Thakur, Jimmy Lin, Sam Havens, Michael Carbin, Omar Khattab, Andrew Drozdov
cs.AI
摘要
我们推出FreshStack,一个可复用的框架,用于自动构建基于社区提问与回答的信息检索(IR)评估基准。FreshStack执行以下步骤:(1) 从代码和技术文档中自动收集语料库,(2) 根据社区提问与回答生成信息片段,(3) 在信息片段层面提供支持,通过融合多种检索技术和混合架构来检索文档。我们利用FreshStack构建了五个数据集,涵盖快速发展的、近期的及小众主题,以确保任务具有足够的挑战性。在FreshStack上,现有检索模型直接应用时,在所有五个主题上均显著落后于理想方法,表明提升IR质量仍有很大空间。此外,我们发现重排序器在某些情况下并未明显提升第一阶段的检索准确率(五个主题中有两个)。我们希望FreshStack能促进未来工作,构建现实、可扩展且无污染的IR及RAG评估基准。FreshStack数据集可通过以下网址获取:https://fresh-stack.github.io。
English
We introduce FreshStack, a reusable framework for automatically building
information retrieval (IR) evaluation benchmarks from community-asked questions
and answers. FreshStack conducts the following steps: (1) automatic corpus
collection from code and technical documentation, (2) nugget generation from
community-asked questions and answers, and (3) nugget-level support, retrieving
documents using a fusion of retrieval techniques and hybrid architectures. We
use FreshStack to build five datasets on fast-growing, recent, and niche topics
to ensure the tasks are sufficiently challenging. On FreshStack, existing
retrieval models, when applied out-of-the-box, significantly underperform
oracle approaches on all five topics, denoting plenty of headroom to improve IR
quality. In addition, we identify cases where rerankers do not clearly improve
first-stage retrieval accuracy (two out of five topics). We hope that
FreshStack will facilitate future work toward constructing realistic, scalable,
and uncontaminated IR and RAG evaluation benchmarks. FreshStack datasets are
available at: https://fresh-stack.github.io.Summary
AI-Generated Summary