DPLM-2:一個多模擬擴散蛋白語言模型
DPLM-2: A Multimodal Diffusion Protein Language Model
October 17, 2024
作者: Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, Quanquan Gu
cs.AI
摘要
蛋白質是由其氨基酸序列所定義的基本大分子,這些序列決定了它們的三維結構,進而影響所有生物體中的功能。因此,生成式蛋白建模需要一種多模態方法,同時對序列和結構進行建模、理解和生成。然而,現有方法通常使用各自的模型來處理每個模態,限制了它們捕捉序列和結構之間錯綜複雜關係的能力。這導致在需要聯合理解和生成兩種模態的任務中表現不佳。本文介紹了DPLM-2,這是一個多模態蛋白基礎模型,擴展了離散擴散蛋白語言模型(DPLM),以適應序列和結構。為了使語言模型能夠進行結構學習,3D坐標被轉換為離散標記,使用基於量化的無查找分詞器。通過在實驗和高質量合成結構上進行訓練,DPLM-2學習了序列和結構的聯合分佈,以及它們的邊際和條件分佈。我們還實現了一種有效的預熱策略,以利用大規模演化數據和來自預先訓練的基於序列的蛋白質語言模型的結構歸納偏差之間的聯繫。實證評估表明,DPLM-2可以同時生成高度兼容的氨基酸序列及其對應的3D結構,無需兩階段生成方法。此外,DPLM-2在各種條件生成任務中展現了競爭性表現,包括折疊、逆向折疊和支架搭建,並提供了結構感知表示以進行預測任務。
English
Proteins are essential macromolecules defined by their amino acid sequences,
which determine their three-dimensional structures and, consequently, their
functions in all living organisms. Therefore, generative protein modeling
necessitates a multimodal approach to simultaneously model, understand, and
generate both sequences and structures. However, existing methods typically use
separate models for each modality, limiting their ability to capture the
intricate relationships between sequence and structure. This results in
suboptimal performance in tasks that requires joint understanding and
generation of both modalities. In this paper, we introduce DPLM-2, a multimodal
protein foundation model that extends discrete diffusion protein language model
(DPLM) to accommodate both sequences and structures. To enable structural
learning with the language model, 3D coordinates are converted to discrete
tokens using a lookup-free quantization-based tokenizer. By training on both
experimental and high-quality synthetic structures, DPLM-2 learns the joint
distribution of sequence and structure, as well as their marginals and
conditionals. We also implement an efficient warm-up strategy to exploit the
connection between large-scale evolutionary data and structural inductive
biases from pre-trained sequence-based protein language models. Empirical
evaluation shows that DPLM-2 can simultaneously generate highly compatible
amino acid sequences and their corresponding 3D structures eliminating the need
for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive
performance in various conditional generation tasks, including folding, inverse
folding, and scaffolding with multimodal motif inputs, as well as providing
structure-aware representations for predictive tasks.Summary
AI-Generated Summary