OpenScholar:使用检索增强的语言模型综合科学文献
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
November 21, 2024
作者: Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen-tau Yih, Pang Wei Koh, Hannaneh Hajishirzi
cs.AI
摘要
科学进步取决于研究人员综合不断增长的文献。大型语言模型(LMs)能帮助科学家完成这项任务吗?我们介绍了OpenScholar,这是一种专门的检索增强型LM,通过从4500万篇开放获取论文中识别相关段落并综合支持引用的回答来回应科学查询。为了评估OpenScholar,我们开发了ScholarQABench,这是第一个大规模多领域文献检索基准,包括2967个专家撰写的查询和208个长格式答案,涵盖计算机科学、物理学、神经科学和生物医学。在ScholarQABench上,OpenScholar-8B在正确性方面比GPT-4o高出5%,比PaperQA2高出7%,尽管OpenScholar是一个较小的开放模型。虽然GPT-4o在78%到90%的时间内会产生引文幻觉,但OpenScholar的引文准确性与人类专家持平。OpenScholar的数据存储、检索器和自我反馈推理循环还改进了现成的LMs:例如,OpenScholar-GPT4o将GPT-4o的正确性提高了12%。在人类评估中,专家更倾向于OpenScholar-8B和OpenScholar-GPT4o的回答,而不是专家撰写的回答,分别达到51%和70%,而GPT-4o只有32%。我们开放源代码、模型、数据存储、数据以及公开演示。
English
Scientific progress depends on researchers' ability to synthesize the growing
body of literature. Can large language models (LMs) assist scientists in this
task? We introduce OpenScholar, a specialized retrieval-augmented LM that
answers scientific queries by identifying relevant passages from 45 million
open-access papers and synthesizing citation-backed responses. To evaluate
OpenScholar, we develop ScholarQABench, the first large-scale multi-domain
benchmark for literature search, comprising 2,967 expert-written queries and
208 long-form answers across computer science, physics, neuroscience, and
biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and
PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o
hallucinates citations 78 to 90% of the time, OpenScholar achieves citation
accuracy on par with human experts. OpenScholar's datastore, retriever, and
self-feedback inference loop also improves off-the-shelf LMs: for instance,
OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations,
experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over
expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's
32%. We open-source all of our code, models, datastore, data and a public demo.Summary
AI-Generated Summary