DPLM-2: 다중 모달 확산 단백질 언어 모델

초록

단백질은 아미노산 서열에 의해 정의되는 필수 대형 분자로, 이는 모든 생물체에서의 기능을 결정하는 삼차원 구조를 결정합니다. 따라서 생성적 단백질 모델링은 서열과 구조를 동시에 모델링하고 이해하며 생성하기 위해 다중 모달 접근이 필요합니다. 그러나 기존 방법은 일반적으로 각 모달리티에 대해 별도의 모델을 사용하여 서열과 구조 간 복잡한 관계를 포착하는 능력을 제한합니다. 이로 인해 서열과 구조를 동시에 이해하고 생성하는 작업에서 최적의 성능을 발휘하지 못하는 결과를 초래합니다. 본 논문에서는 DPLM-2를 소개합니다. 이는 서열과 구조를 수용하기 위해 이산 확산 단백질 언어 모델(DPLM)을 확장한 다중 모달 단백질 기초 모델입니다. 언어 모델과 구조적 학습을 가능하게 하기 위해 3D 좌표는 룩업 없는 양자화 기반 토크나이저를 사용하여 이산 토큰으로 변환됩니다. 실험적 및 고품질 합성 구조에 대해 교육함으로써 DPLM-2는 서열과 구조의 결합 분포뿐만 아니라 그 마진 및 조건부를 학습합니다. 또한 대규모 진화 데이터와 사전 훈련된 서열 기반 단백질 언어 모델로부터의 구조적 귀납 편향 사이의 연결을 활용하기 위한 효율적인 웜업 전략을 구현합니다. 경험적 평가 결과, DPLM-2는 두 단계 생성 접근 방식이 필요 없이 매우 호환되는 아미노산 서열과 해당 3D 구조를 동시에 생성할 수 있음을 보여줍니다. 게다가 DPLM-2는 접이, 역접이 및 다중 모달 모티프 입력을 사용한 고정방식과 같은 다양한 조건부 생성 작업에서 경쟁력 있는 성능을 보여주며, 예측 작업을 위한 구조 인식 표현을 제공합니다.

English

Proteins are essential macromolecules defined by their amino acid sequences, which determine their three-dimensional structures and, consequently, their functions in all living organisms. Therefore, generative protein modeling necessitates a multimodal approach to simultaneously model, understand, and generate both sequences and structures. However, existing methods typically use separate models for each modality, limiting their ability to capture the intricate relationships between sequence and structure. This results in suboptimal performance in tasks that requires joint understanding and generation of both modalities. In this paper, we introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. To enable structural learning with the language model, 3D coordinates are converted to discrete tokens using a lookup-free quantization-based tokenizer. By training on both experimental and high-quality synthetic structures, DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. We also implement an efficient warm-up strategy to exploit the connection between large-scale evolutionary data and structural inductive biases from pre-trained sequence-based protein language models. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures eliminating the need for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive performance in various conditional generation tasks, including folding, inverse folding, and scaffolding with multimodal motif inputs, as well as providing structure-aware representations for predictive tasks.

DPLM-2: 다중 모달 확산 단백질 언어 모델

DPLM-2: A Multimodal Diffusion Protein Language Model

초록

Support