OpenScholar：使用检索增强的语言模型综合科学文献

摘要

科学进步取决于研究人员综合不断增长的文献。大型语言模型（LMs）能帮助科学家完成这项任务吗？我们介绍了OpenScholar，这是一种专门的检索增强型LM，通过从4500万篇开放获取论文中识别相关段落并综合支持引用的回答来回应科学查询。为了评估OpenScholar，我们开发了ScholarQABench，这是第一个大规模多领域文献检索基准，包括2967个专家撰写的查询和208个长格式答案，涵盖计算机科学、物理学、神经科学和生物医学。在ScholarQABench上，OpenScholar-8B在正确性方面比GPT-4o高出5%，比PaperQA2高出7%，尽管OpenScholar是一个较小的开放模型。虽然GPT-4o在78%到90%的时间内会产生引文幻觉，但OpenScholar的引文准确性与人类专家持平。OpenScholar的数据存储、检索器和自我反馈推理循环还改进了现成的LMs：例如，OpenScholar-GPT4o将GPT-4o的正确性提高了12%。在人类评估中，专家更倾向于OpenScholar-8B和OpenScholar-GPT4o的回答，而不是专家撰写的回答，分别达到51%和70%，而GPT-4o只有32%。我们开放源代码、模型、数据存储、数据以及公开演示。

English

Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.

OpenScholar：使用检索增强的语言模型综合科学文献

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

摘要

Support