METAGENE-1:用于流行病监测的宏基因组基础模型
METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring
January 3, 2025
作者: Ollie Liu, Sami Jaghouar, Johannes Hagemann, Shangshang Wang, Jason Wiemels, Jeff Kaufman, Willie Neiswanger
cs.AI
摘要
我们对METAGENE-1进行预训练,这是一个70亿参数的自回归变压器模型,我们将其称为宏基因组基础模型,使用一个包含超过1.5万亿碱基对的多样化宏基因组DNA和RNA序列的新领域语料库进行训练。这个数据集来源于大量人类废水样本,使用深度宏基因组(下一代)测序方法进行处理和测序。与侧重于单个基因组或特定物种的策划集的基因组模型不同,METAGENE-1的目标是捕获存在于这种废水中的基因组信息的完整分布,以协助与疫情监测和病原体检测相关的任务。我们对数据集进行字节对编码(BPE)标记化,针对宏基因组序列进行定制,然后对我们的模型进行预训练。在本文中,我们首先详细介绍了预训练数据集、标记化策略和模型架构,突出考虑因素和设计选择,以实现对宏基因组数据的有效建模。然后,我们展示了在我们的宏基因组数据集上对该模型进行预训练的结果,提供了关于我们的损失、系统指标以及预训练过程中的训练稳定性的详细信息。最后,我们展示了METAGENE-1的性能,该模型在一组基因组基准测试和专注于人类病原体检测和基因组序列嵌入的新评估中取得了最先进的结果,展示了其在疫情监测、生物监测和早期发现新兴健康威胁方面的潜力。
English
We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer
model, which we refer to as a metagenomic foundation model, on a novel corpus
of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base
pairs. This dataset is sourced from a large collection of human wastewater
samples, processed and sequenced using deep metagenomic (next-generation)
sequencing methods. Unlike genomic models that focus on individual genomes or
curated sets of specific species, the aim of METAGENE-1 is to capture the full
distribution of genomic information present within this wastewater, to aid in
tasks relevant to pandemic monitoring and pathogen detection. We carry out
byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic
sequences, and then pretrain our model. In this paper, we first detail the
pretraining dataset, tokenization strategy, and model architecture,
highlighting the considerations and design choices that enable the effective
modeling of metagenomic data. We then show results of pretraining this model on
our metagenomic dataset, providing details about our losses, system metrics,
and training stability over the course of pretraining. Finally, we demonstrate
the performance of METAGENE-1, which achieves state-of-the-art results on a set
of genomic benchmarks and new evaluations focused on human-pathogen detection
and genomic sequence embedding, showcasing its potential for public health
applications in pandemic monitoring, biosurveillance, and early detection of
emerging health threats.Summary
AI-Generated Summary