ChatPaper.aiChatPaper

事實、檢索和推理:檢索增強生成的統一評估

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

September 19, 2024
作者: Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, Manaal Faruqui
cs.AI

摘要

大型語言模型(LLMs)已在各種認知任務中展示出顯著的性能改進。一個新興應用是利用LLMs增強檢索增強生成(RAG)能力。這些系統需要LLMs理解用戶查詢,檢索相關信息,並合成連貫準確的回應。鑒於這些系統在現實世界中的部署日益增多,全面的評估變得至關重要。為此,我們提出了FRAMES(Factuality, Retrieval, And reasoning MEasurement Set),這是一個高質量的評估數據集,旨在測試LLMs提供事實性回應的能力,評估檢索能力,並評估生成最終答案所需的推理。雖然先前的工作提供了用於獨立評估這些能力的數據集和基準,但FRAMES提供了一個統一框架,更清晰地展示了LLMs在端到端RAG情景中的性能。我們的數據集包含具有挑戰性的多跳問題,需要整合來自多個來源的信息。我們提出的基準結果顯示,即使是最先進的LLMs在這項任務上也面臨困難,沒有檢索時的準確率為0.40。通過我們提出的多步驟檢索管道,準確率顯著提高,達到0.66(>50%的改進)。我們希望我們的工作將有助於彌合評估差距,並協助開發更加強大和有能力的RAG系統。
English
Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems.

Summary

AI-Generated Summary

PDF253November 16, 2024