ChatPaper.aiChatPaper

ScholarCopilot:训练大型语言模型实现精准引用的学术写作

ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

April 1, 2025
作者: Yubo Wang, Xueguang Ma, Ping Nie, Huaye Zeng, Zhiheng Lyu, Yuxuan Zhang, Benjamin Schneider, Yi Lu, Xiang Yue, Wenhu Chen
cs.AI

摘要

学术写作既需要连贯的文本生成,也要求对相关文献进行精准引用。尽管近期的检索增强生成(RAG)系统在通用文本生成的事实准确性上取得了显著提升,但其在支持专业学术写作方面的能力仍显不足。本研究中,我们提出了ScholarCopilot,一个旨在增强现有大语言模型以生成带有准确且上下文相关引用的专业学术文章的统一框架。ScholarCopilot通过生成检索标记[RET]动态决定何时检索学术参考文献,并利用其表示从数据库中查找相关引用。检索到的参考文献被输入模型以增强生成过程。我们在单一框架内联合优化生成与引用任务,以提高效率。基于arXiv上50万篇论文训练,我们的模型在评估数据集上实现了40.1%的Top-1检索准确率,超越了如E5-Mistral-7B-Instruct(15.0%)和BM25(9.8%)等基线模型。在1000份学术写作样本的数据集上,ScholarCopilot在生成质量(涵盖相关性、连贯性、学术严谨性、完整性和创新性)上获得16.2/25分,优于参数规模大10倍的模型如Qwen-2.5-72B-Instruct(15.8/25)。人类研究也证实了ScholarCopilot在引用召回率、写作效率及整体用户体验上的卓越表现,验证了我们方法的有效性。
English
Academic writing requires both coherent text generation and precise citation of relevant literature. Although recent Retrieval-Augmented Generation (RAG) systems have significantly improved factual accuracy in general-purpose text generation, their capacity to adequately support professional academic writing remains limited. In this work, we introduce ScholarCopilot, a unified framework designed to enhance existing large language models for generating professional academic articles with accurate and contextually relevant citations. ScholarCopilot dynamically determines when to retrieve scholarly references by generating a retrieval token [RET], and then utilizes its representation to look up relevant citations from a database. The retrieved references are fed into the model to augment the generation process. We jointly optimize both the generation and citation tasks within a single framework to increase efficiency. Trained on 500K papers from arXiv, our model achieves a top-1 retrieval accuracy of 40.1% on our evaluation dataset, outperforming baselines such as E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000 academic writing samples, ScholarCopilot scores 16.2/25 in generation quality (measured across relevance, coherence, academic rigor, completeness, and innovation), surpassing models with 10x more parameters such as Qwen-2.5-72B-Instruct (15.8/25). Human studies also confirm ScholarCopilot's superior performance in citation recall, writing efficiency, and overall user experience, confirming the effectiveness of our approach.

Summary

AI-Generated Summary

PDF402April 3, 2025