NeoBERT:新一代BERT模型
NeoBERT: A Next-Generation BERT
February 26, 2025
作者: Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar
cs.AI
摘要
近期在架构设计、预训练和微调方面的创新,显著提升了如LLaMA和DeepSeek等大型自回归语言模型的上下文学习与推理能力。相比之下,尽管BERT和RoBERTa等编码器为众多下游自然语言处理应用奠定了基础,却未能取得同等程度的进展。为弥合这一差距,我们推出了NeoBERT,这一新一代编码器通过整合架构、现代数据及优化预训练方法中的尖端技术,重新定义了双向模型的潜力。NeoBERT设计上注重无缝集成:它可作为现有基础模型的即插即用替代品,采用最佳深度与宽度比例,并利用长达4,096个标记的扩展上下文长度。尽管其参数规模仅为2.5亿,却在庞大的MTEB基准测试中取得了领先成绩,在相同微调条件下超越了BERT大模型、RoBERTa大模型、NomicBERT及ModernBERT。此外,我们深入评估了各项改进对GLUE的影响,并为MTEB设计了一套统一的微调与评估框架。我们公开了所有代码、数据、检查点及训练脚本,以加速研究进展与实际应用部署。
English
Recent innovations in architecture, pre-training, and fine-tuning have led to
the remarkable in-context learning and reasoning abilities of large
auto-regressive language models such as LLaMA and DeepSeek. In contrast,
encoders like BERT and RoBERTa have not seen the same level of progress despite
being foundational for many downstream NLP applications. To bridge this gap, we
introduce NeoBERT, a next-generation encoder that redefines the capabilities of
bidirectional models by integrating state-of-the-art advancements in
architecture, modern data, and optimized pre-training methodologies. NeoBERT is
designed for seamless adoption: it serves as a plug-and-play replacement for
existing base models, relies on an optimal depth-to-width ratio, and leverages
an extended context length of 4,096 tokens. Despite its compact 250M parameter
footprint, it achieves state-of-the-art results on the massive MTEB benchmark,
outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under
identical fine-tuning conditions. In addition, we rigorously evaluate the
impact of each modification on GLUE and design a uniform fine-tuning and
evaluation framework for MTEB. We release all code, data, checkpoints, and
training scripts to accelerate research and real-world adoption.Summary
AI-Generated Summary