FreshStack:構建用於評估技術文件檢索的真實基準
FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents
April 17, 2025
作者: Nandan Thakur, Jimmy Lin, Sam Havens, Michael Carbin, Omar Khattab, Andrew Drozdov
cs.AI
摘要
我們介紹了FreshStack,這是一個可重用的框架,用於從社區提問與回答中自動構建信息檢索(IR)評估基準。FreshStack執行以下步驟:(1) 從代碼和技術文檔中自動收集語料,(2) 從社區提問與回答中生成信息片段,(3) 信息片段級別的支持,通過融合檢索技術和混合架構來檢索文檔。我們使用FreshStack構建了五個關於快速發展、新近及小眾主題的數據集,以確保任務具有足夠的挑戰性。在FreshStack上,現有的檢索模型在直接應用時,在所有五個主題上均顯著落後於理想方法,表明在提升IR質量方面仍有很大空間。此外,我們發現重排序器並未在所有情況下明顯提升第一階段檢索的準確性(五個主題中有兩個)。我們希望FreshStack能促進未來在構建真實、可擴展且無污染的IR和RAG評估基準方面的工作。FreshStack數據集可在以下網址獲取:https://fresh-stack.github.io。
English
We introduce FreshStack, a reusable framework for automatically building
information retrieval (IR) evaluation benchmarks from community-asked questions
and answers. FreshStack conducts the following steps: (1) automatic corpus
collection from code and technical documentation, (2) nugget generation from
community-asked questions and answers, and (3) nugget-level support, retrieving
documents using a fusion of retrieval techniques and hybrid architectures. We
use FreshStack to build five datasets on fast-growing, recent, and niche topics
to ensure the tasks are sufficiently challenging. On FreshStack, existing
retrieval models, when applied out-of-the-box, significantly underperform
oracle approaches on all five topics, denoting plenty of headroom to improve IR
quality. In addition, we identify cases where rerankers do not clearly improve
first-stage retrieval accuracy (two out of five topics). We hope that
FreshStack will facilitate future work toward constructing realistic, scalable,
and uncontaminated IR and RAG evaluation benchmarks. FreshStack datasets are
available at: https://fresh-stack.github.io.Summary
AI-Generated Summary