事實、檢索和推理:檢索增強生成的統一評估
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
September 19, 2024
作者: Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, Manaal Faruqui
cs.AI
摘要
大型語言模型(LLMs)已在各種認知任務中展示出顯著的性能改進。一個新興應用是利用LLMs增強檢索增強生成(RAG)能力。這些系統需要LLMs理解用戶查詢,檢索相關信息,並合成連貫準確的回應。鑒於這些系統在現實世界中的部署日益增多,全面的評估變得至關重要。為此,我們提出了FRAMES(Factuality, Retrieval, And reasoning MEasurement Set),這是一個高質量的評估數據集,旨在測試LLMs提供事實性回應的能力,評估檢索能力,並評估生成最終答案所需的推理。雖然先前的工作提供了用於獨立評估這些能力的數據集和基準,但FRAMES提供了一個統一框架,更清晰地展示了LLMs在端到端RAG情景中的性能。我們的數據集包含具有挑戰性的多跳問題,需要整合來自多個來源的信息。我們提出的基準結果顯示,即使是最先進的LLMs在這項任務上也面臨困難,沒有檢索時的準確率為0.40。通過我們提出的多步驟檢索管道,準確率顯著提高,達到0.66(>50%的改進)。我們希望我們的工作將有助於彌合評估差距,並協助開發更加強大和有能力的RAG系統。
English
Large Language Models (LLMs) have demonstrated significant performance
improvements across various cognitive tasks. An emerging application is using
LLMs to enhance retrieval-augmented generation (RAG) capabilities. These
systems require LLMs to understand user queries, retrieve relevant information,
and synthesize coherent and accurate responses. Given the increasing real-world
deployment of such systems, comprehensive evaluation becomes crucial. To this
end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set),
a high-quality evaluation dataset designed to test LLMs' ability to provide
factual responses, assess retrieval capabilities, and evaluate the reasoning
required to generate final answers. While previous work has provided datasets
and benchmarks to evaluate these abilities in isolation, FRAMES offers a
unified framework that provides a clearer picture of LLM performance in
end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions
that require the integration of information from multiple sources. We present
baseline results demonstrating that even state-of-the-art LLMs struggle with
this task, achieving 0.40 accuracy with no retrieval. The accuracy is
significantly improved with our proposed multi-step retrieval pipeline,
achieving an accuracy of 0.66 (>50% improvement). We hope our work will help
bridge evaluation gaps and assist in developing more robust and capable RAG
systems.Summary
AI-Generated Summary