能源高效的蛋白质语言模型:利用LoRA技术提升小型语言模型的可控蛋白质生成
Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation
November 8, 2024
作者: Aayush Shah, Shankar Jayaratnam
cs.AI
摘要
大型语言模型(LLMs)在自然语言处理(NLP)任务中取得了显著成功,并在蛋白质序列生成等其他领域展现出了有希望的结果。然而,用于NLP的LLMs与蛋白质语言模型之间仍存在显著差异。NLP中的LLMs能够有效处理多个任务,并且有小尺寸版本可用,而蛋白质语言模型通常专门针对特定任务,并且只存在较大尺寸版本。在本研究中,我们介绍了两个基于Llama-3-8B和Phi-3-mini的小型蛋白质语言模型,能够进行无控制和可控制的蛋白质生成。对于无控制生成任务,我们的最佳模型实现了平均pLDDT得分为69.75,表明在生成可行蛋白质结构方面表现出了稳健性。对于可控制生成任务,模型根据提示中指定的属性生成蛋白质,在这方面,我们实现了显著的平均TM-Score为0.84,表明与目标蛋白质具有高结构相似性。我们选择了10个属性,包括六类酶,以扩展先前蛋白质语言模型的能力。我们的方法利用了低秩适配器(LoRA)技术,将可训练参数减少到原始模型尺寸的仅4%,降低了计算需求。通过使用UniRef50数据集的子集和小型模型,我们将整体训练时间减少了70%,而不影响性能。值得注意的是,Phi-3-mini将可训练参数减少了60%,将训练成本降低了30%,与Llama 3相比,Phi-3实现了可比的TM-Score为0.81,表明较小的模型可以匹配较大的模型性能。我们还展示了将我们的模型部署在能效高的ET-SoC-1芯片上,将TPS/W显著提高了3倍。
English
Large language models (LLMs) have demonstrated significant success in natural
language processing (NLP) tasks and have shown promising results in other
domains such as protein sequence generation. However, there remain salient
differences between LLMs used for NLP, which effectively handle multiple tasks
and are available in small sizes, and protein language models that are often
specialized for specific tasks and only exist in larger sizes. In this work, we
introduce two small protein language models, based on Llama-3-8B and
Phi-3-mini, that are capable of both uncontrollable and controllable protein
generation. For the uncontrollable generation task, our best model achieves an
average pLDDT score of 69.75, demonstrating robust performance in generating
viable protein structures. For the controllable generation task, in which the
model generates proteins according to properties specified in the prompt, we
achieve a remarkable average TM-Score of 0.84, indicating high structural
similarity to target proteins. We chose 10 properties, including six classes of
enzymes, to extend the capabilities of prior protein language models. Our
approach utilizes the Low-Rank Adaptor (LoRA) technique, reducing trainable
parameters to just 4% of the original model size, lowering computational
requirements. By using a subset of the UniRef50 dataset and small models, we
reduced the overall training time by 70% without compromising performance.
Notably, Phi-3-mini reduced trainable parameters by 60%, decreasing training
cost by 30% compared to Llama 3. Consequently, Phi-3 achieved a comparable
TM-Score of 0.81, demonstrating that smaller models can match the performance
of larger ones, like Llama 3. We also demonstrate the deployment of our models
on the energy efficient ET-SoC-1 chip, significantly improving the TPS/W by a
factor of 3.Summary
AI-Generated Summary