NatureLM:解读自然语言以实现科学发现
NatureLM: Deciphering the Language of Nature for Scientific Discovery
February 11, 2025
作者: Yingce Xia, Peiran Jin, Shufang Xie, Liang He, Chuan Cao, Renqian Luo, Guoqing Liu, Yue Wang, Zequn Liu, Yuan-Jyue Chen, Zekun Guo, Yeqi Bai, Pan Deng, Yaosen Min, Ziheng Lu, Hongxia Hao, Han Yang, Jielan Li, Chang Liu, Jia Zhang, Jianwei Zhu, Kehan Wu, Wei Zhang, Kaiyuan Gao, Qizhi Pei, Qian Wang, Xixian Liu, Yanting Li, Houtian Zhu, Yeqing Lu, Mingqian Ma, Zun Wang, Tian Xie, Krzysztof Maziarz, Marwin Segler, Zhao Yang, Zilong Chen, Yu Shi, Shuxin Zheng, Lijun Wu, Chen Hu, Peggy Dai, Tie-Yan Liu, Haiguang Liu, Tao Qin
cs.AI
摘要
基础模型已经彻底改变了自然语言处理和人工智能,显著提升了机器理解和生成人类语言的能力。受这些基础模型成功的启发,研究人员已经为个别科学领域开发了基础模型,包括小分子、材料、蛋白质、DNA 和 RNA。然而,这些模型通常是孤立训练的,缺乏跨不同科学领域整合的能力。鉴于这些领域内的实体都可以被表示为序列,这些序列共同构成了“自然语言”,我们引入了自然语言模型(简称 NatureLM),这是一个基于序列的科学基础模型,旨在用于科学发现。NatureLM 预先使用来自多个科学领域的数据进行训练,提供了一个统一且多才多艺的模型,能够实现各种应用,包括:(i) 使用文本说明生成和优化小分子、蛋白质、RNA 和材料;(ii) 跨领域生成/设计,比如蛋白质到分子和蛋白质到RNA的生成;以及 (iii) 在 SMILES 到 IUPAC 翻译和 USPTO-50k 上的逆合成等任务中取得最先进的性能。NatureLM 为各种科学任务提供了一种有前途的通用方法,包括药物发现(命中生成/优化、ADMET 优化、合成)、新型材料设计,以及治疗蛋白质或核苷酸的开发。我们已经开发了不同规模的 NatureLM 模型(10 亿、80 亿和467 亿参数),观察到随着模型规模增加,性能明显提升。
English
Foundation models have revolutionized natural language processing and
artificial intelligence, significantly enhancing how machines comprehend and
generate human languages. Inspired by the success of these foundation models,
researchers have developed foundation models for individual scientific domains,
including small molecules, materials, proteins, DNA, and RNA. However, these
models are typically trained in isolation, lacking the ability to integrate
across different scientific domains. Recognizing that entities within these
domains can all be represented as sequences, which together form the "language
of nature", we introduce Nature Language Model (briefly, NatureLM), a
sequence-based science foundation model designed for scientific discovery.
Pre-trained with data from multiple scientific domains, NatureLM offers a
unified, versatile model that enables various applications including: (i)
generating and optimizing small molecules, proteins, RNA, and materials using
text instructions; (ii) cross-domain generation/design, such as
protein-to-molecule and protein-to-RNA generation; and (iii) achieving
state-of-the-art performance in tasks like SMILES-to-IUPAC translation and
retrosynthesis on USPTO-50k. NatureLM offers a promising generalist approach
for various scientific tasks, including drug discovery (hit
generation/optimization, ADMET optimization, synthesis), novel material design,
and the development of therapeutic proteins or nucleotides. We have developed
NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion
parameters) and observed a clear improvement in performance as the model size
increases.Summary
AI-Generated Summary