OpenScholar：利用檢索增強型語言模型綜合科學文獻

摘要

科學進步取決於研究人員綜合不斷增長的文獻。大型語言模型（LMs）能幫助科學家完成這項任務嗎？我們介紹了OpenScholar，這是一種專門的檢索增強型LM，通過識別來自4500萬篇開放訪問論文的相關段落並綜合支持引用的回答來回應科學查詢。為了評估OpenScholar，我們開發了ScholarQABench，這是第一個大規模多領域文獻搜索基準，包括2967個專家撰寫的查詢和208個長篇答案，涵蓋計算機科學、物理學、神經科學和生物醫學。在ScholarQABench上，OpenScholar-8B在正確性方面優於GPT-4o 5%，優於PaperQA2 7%，儘管OpenScholar是一個較小的開放模型。儘管GPT-4o在78%至90%的情況下會產生引文幻覺，但OpenScholar的引文準確性與人類專家相當。OpenScholar的數據存儲庫、檢索器和自我反饋推理循環還改進了現成的LMs：例如，OpenScholar-GPT4o將GPT-4o的正確性提高了12%。在人類評估中，專家更喜歡OpenScholar-8B和OpenScholar-GPT4o的回應，分別比專家撰寫的回應多51%和70%，而GPT4o則為32%。我們開源了所有代碼、模型、數據存儲庫、數據以及公開演示。

English

Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.

OpenScholar：利用檢索增強型語言模型綜合科學文獻

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

摘要

Support