METAGENE-1:用於疫情監測的宏基因組基礎模型

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

January 3, 2025
作者: Ollie Liu, Sami Jaghouar, Johannes Hagemann, Shangshang Wang, Jason Wiemels, Jeff Kaufman, Willie Neiswanger
cs.AI

摘要

我們對METAGENE-1進行預訓練,這是一個擁有70億參數的自回歸Transformer模型,我們稱之為一個微生物基礎模型,使用一個包含超過1.5萬億個鹼基對的多樣化微生物基因組DNA和RNA序列的新語料庫。這個數據集來自大量人類污水樣本,使用深度微生物基因組(次世代)测序方法處理和测序。與專注於單個基因組或特定物種的精心編輯集合的基因組模型不同,METAGENE-1的目標是捕獲存在於這個污水中的基因組信息的完整分佈,以協助與疫情監測和病原體檢測相關的任務。我們對我們的數據集進行字節對編碼(BPE)標記化,針對微生物基因組序列進行定制,然後對我們的模型進行預訓練。在本文中,我們首先詳細介紹預訓練數據集、標記化策略和模型架構,突出考慮因素和設計選擇,以實現對微生物基因組數據的有效建模。然後,我們展示了在我們的微生物基因組數據集上對該模型進行預訓練的結果,提供有關我們損失、系統指標和預訓練過程中的訓練穩定性的細節。最後,我們展示了METAGENE-1的性能,該模型在一組基因組基準測試和專注於人體病原體檢測和基因組序列嵌入的新評估上取得了最先進的結果,展示了它在疫情監測、生物監視和早期檢測新興健康威脅方面的潛力。
English
We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.

Summary

AI-Generated Summary

PDF212January 7, 2025