ChatPaper.aiChatPaper

临床现代BERT:一种高效且支持长文本的生物医学文本编码器

Clinical ModernBERT: An efficient and long context encoder for biomedical text

April 4, 2025
作者: Simon A. Lee, Anthony Wu, Jeffrey N. Chiang
cs.AI

摘要

我们推出Clinical ModernBERT,这是一种基于Transformer架构的编码器,通过大规模生物医学文献、临床记录及医学术语进行预训练,整合了PubMed摘要、MIMIC IV临床数据以及带有文本描述的医疗编码。该模型在ModernBERT的基础上构建,后者作为当前自然语言文本编码器的前沿技术,引入了诸如旋转位置嵌入(RoPE)、Flash Attention以及扩展至8,192个标记的上下文长度等架构升级。我们的模型特别针对生物医学和临床领域调整了这些创新技术。Clinical ModernBERT在生成适用于长上下文任务的语义丰富表示方面表现卓越。我们通过分析其预训练权重,并在一系列全面的临床自然语言处理基准测试中进行实证评估,验证了其有效性。
English
We introduce Clinical ModernBERT, a transformer based encoder pretrained on large scale biomedical literature, clinical notes, and medical ontologies, incorporating PubMed abstracts, MIMIC IV clinical data, and medical codes with their textual descriptions. Building on ModernBERT the current state of the art natural language text encoder featuring architectural upgrades such as rotary positional embeddings (RoPE), Flash Attention, and extended context length up to 8,192 tokens our model adapts these innovations specifically for biomedical and clinical domains. Clinical ModernBERT excels at producing semantically rich representations tailored for long context tasks. We validate this both by analyzing its pretrained weights and through empirical evaluation on a comprehensive suite of clinical NLP benchmarks.

Summary

AI-Generated Summary

PDF52April 8, 2025