高精度单细胞转录组学分析与生成的多模态语言建模
Multimodal Language Modeling for High-Accuracy Single Cell Transcriptomics Analysis and Generation
March 12, 2025
作者: Yaorui Shi, Jiaqi Yang, Sihang Li, Junfeng Fang, Xiang Wang, Zhiyuan Liu, Yang Zhang
cs.AI
摘要
预训练语言模型(PLMs)已彻底革新了科学研究,然而其在单细胞分析中的应用仍显局限。文本PLMs无法处理单细胞RNA测序数据,而细胞PLMs则缺乏处理自由文本的能力,这限制了它们在多模态任务中的使用。现有尝试融合这些模态的努力常面临信息丢失或单模态预训练不足的问题,导致性能欠佳。为应对这些挑战,我们提出了单细胞多模态生成预训练变换器(scMMGPT),一种用于联合细胞与文本建模的统一PLM。scMMGPT高效整合了最先进的细胞与文本PLMs,促进了跨模态知识共享,从而提升性能。为弥合文本与细胞模态间的鸿沟,scMMGPT采用了专门的跨模态投影器,并在2700万个细胞上进行了大规模预训练——这是迄今为止多模态细胞-文本PLMs使用的最大数据集。这一大规模预训练使scMMGPT在联合细胞-文本任务中表现卓越,在细胞描述生成的文本差异度上实现了84%的相对提升,细胞类型注释准确率提高了20.5%,在文本条件伪细胞生成的k近邻准确率上提升了4%,全面超越了基线模型。
English
Pre-trained language models (PLMs) have revolutionized scientific research,
yet their application to single-cell analysis remains limited. Text PLMs cannot
process single-cell RNA sequencing data, while cell PLMs lack the ability to
handle free text, restricting their use in multimodal tasks. Existing efforts
to bridge these modalities often suffer from information loss or inadequate
single-modal pre-training, leading to suboptimal performances. To address these
challenges, we propose Single-Cell MultiModal Generative Pre-trained
Transformer (scMMGPT), a unified PLM for joint cell and text modeling. scMMGPT
effectively integrates the state-of-the-art cell and text PLMs, facilitating
cross-modal knowledge sharing for improved performance. To bridge the text-cell
modality gap, scMMGPT leverages dedicated cross-modal projectors, and undergoes
extensive pre-training on 27 million cells -- the largest dataset for
multimodal cell-text PLMs to date. This large-scale pre-training enables
scMMGPT to excel in joint cell-text tasks, achieving an 84\% relative
improvement of textual discrepancy for cell description generation, 20.5\%
higher accuracy for cell type annotation, and 4\% improvement in k-NN
accuracy for text-conditioned pseudo-cell generation, outperforming baselines.Summary
AI-Generated Summary