OpenScholar:利用檢索增強型語言模型綜合科學文獻
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
November 21, 2024
作者: Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen-tau Yih, Pang Wei Koh, Hannaneh Hajishirzi
cs.AI
摘要
科學進步取決於研究人員綜合不斷增長的文獻。大型語言模型(LMs)能幫助科學家完成這項任務嗎?我們介紹了OpenScholar,這是一種專門的檢索增強型LM,通過識別來自4500萬篇開放訪問論文的相關段落並綜合支持引用的回答來回應科學查詢。為了評估OpenScholar,我們開發了ScholarQABench,這是第一個大規模多領域文獻搜索基準,包括2967個專家撰寫的查詢和208個長篇答案,涵蓋計算機科學、物理學、神經科學和生物醫學。在ScholarQABench上,OpenScholar-8B在正確性方面優於GPT-4o 5%,優於PaperQA2 7%,儘管OpenScholar是一個較小的開放模型。儘管GPT-4o在78%至90%的情況下會產生引文幻覺,但OpenScholar的引文準確性與人類專家相當。OpenScholar的數據存儲庫、檢索器和自我反饋推理循環還改進了現成的LMs:例如,OpenScholar-GPT4o將GPT-4o的正確性提高了12%。在人類評估中,專家更喜歡OpenScholar-8B和OpenScholar-GPT4o的回應,分別比專家撰寫的回應多51%和70%,而GPT4o則為32%。我們開源了所有代碼、模型、數據存儲庫、數據以及公開演示。
English
Scientific progress depends on researchers' ability to synthesize the growing
body of literature. Can large language models (LMs) assist scientists in this
task? We introduce OpenScholar, a specialized retrieval-augmented LM that
answers scientific queries by identifying relevant passages from 45 million
open-access papers and synthesizing citation-backed responses. To evaluate
OpenScholar, we develop ScholarQABench, the first large-scale multi-domain
benchmark for literature search, comprising 2,967 expert-written queries and
208 long-form answers across computer science, physics, neuroscience, and
biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and
PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o
hallucinates citations 78 to 90% of the time, OpenScholar achieves citation
accuracy on par with human experts. OpenScholar's datastore, retriever, and
self-feedback inference loop also improves off-the-shelf LMs: for instance,
OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations,
experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over
expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's
32%. We open-source all of our code, models, datastore, data and a public demo.Summary
AI-Generated Summary