亚历山大项目：借助大语言模型解放科学知识，摆脱版权束缚

摘要

付费墙、许可协议和版权规则常常限制了科学知识的广泛传播与再利用。我们主张，从法律和技术层面提取学术文本中的科学知识都是可行的。现有方法，如文本嵌入，未能可靠地保留事实内容，而简单的改写在法律上可能站不住脚。我们呼吁学界采纳一个新理念：利用大语言模型（LLMs）将学术文献转化为知识单元。这些单元采用结构化数据，捕捉实体、属性及关系，而不包含风格化内容。我们提供的证据表明，知识单元：（1）基于对德国版权法和美国合理使用原则的法律分析，构成了分享受版权保护研究文本知识的法律上可辩护的框架；（2）在四个研究领域内，通过多项选择题（MCQ）对原版权文本事实的测试，保留了约95%的事实知识。将科学知识从版权束缚中解放出来，通过允许语言模型重用受版权保护文本中的重要事实，有望为科学研究和教育带来变革性益处。为此，我们分享了将研究文献转化为知识单元的开源工具。总体而言，我们的工作论证了在尊重版权的同时，实现科学知识民主化获取的可行性。

English

Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts. Current methods, like text embeddings, fail to reliably preserve factual content, and simple paraphrasing may not be legally sound. We urge the community to adopt a new idea: convert scholarly documents into Knowledge Units using LLMs. These units use structured data capturing entities, attributes and relationships without stylistic content. We provide evidence that Knowledge Units: (1) form a legally defensible framework for sharing knowledge from copyrighted research texts, based on legal analyses of German copyright law and U.S. Fair Use doctrine, and (2) preserve most (~95%) factual knowledge from original text, measured by MCQ performance on facts from the original copyrighted text across four research domains. Freeing scientific knowledge from copyright promises transformative benefits for scientific research and education by allowing language models to reuse important facts from copyrighted text. To support this, we share open-source tools for converting research documents into Knowledge Units. Overall, our work posits the feasibility of democratizing access to scientific knowledge while respecting copyright.

亚历山大项目：借助大语言模型解放科学知识，摆脱版权束缚

Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs

摘要

Summary

Support

Support