ChatPaper.aiChatPaper

NeoBERT:新一代BERT模型

NeoBERT: A Next-Generation BERT

February 26, 2025
作者: Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar
cs.AI

摘要

近期在架构设计、预训练和微调方面的创新,显著提升了如LLaMA和DeepSeek等大型自回归语言模型的上下文学习与推理能力。相比之下,尽管BERT和RoBERTa等编码器为众多下游自然语言处理应用奠定了基础,却未能取得同等程度的进展。为弥合这一差距,我们推出了NeoBERT,这一新一代编码器通过整合架构、现代数据及优化预训练方法中的尖端技术,重新定义了双向模型的潜力。NeoBERT设计上注重无缝集成:它可作为现有基础模型的即插即用替代品,采用最佳深度与宽度比例,并利用长达4,096个标记的扩展上下文长度。尽管其参数规模仅为2.5亿,却在庞大的MTEB基准测试中取得了领先成绩,在相同微调条件下超越了BERT大模型、RoBERTa大模型、NomicBERT及ModernBERT。此外,我们深入评估了各项改进对GLUE的影响,并为MTEB设计了一套统一的微调与评估框架。我们公开了所有代码、数据、检查点及训练脚本,以加速研究进展与实际应用部署。
English
Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.

Summary

AI-Generated Summary

PDF386February 28, 2025