ChatPaper.aiChatPaper

现代BERT还是DeBERTaV3?探究架构与数据对Transformer编码器模型性能的影响

ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance

April 11, 2025
作者: Wissam Antoun, Benoît Sagot, Djamé Seddah
cs.AI

摘要

诸如DeBERTaV3和ModernBERT等预训练Transformer编码器模型引入了旨在提升效率与性能的架构创新。尽管ModernBERT的研发者报告称其在多项基准测试上表现优于DeBERTaV3,但由于未公开训练数据且缺乏基于共享数据集的对比,难以判断这些提升是源于架构改进还是训练数据的差异。在本研究中,我们通过将ModernBERT与CamemBERTaV2(一个基于DeBERTaV3的法语模型)在同一数据集上进行预训练,开展了一项控制实验,以隔离模型设计的影响。结果表明,上一代模型在样本效率和整体基准性能上仍占优势,而ModernBERT的主要优势在于更快的训练和推理速度。尽管如此,与BERT和RoBERTa等早期模型相比,新提出的模型仍展现出显著的架构改进。此外,我们观察到高质量预训练数据虽能加速收敛,但对最终性能提升有限,暗示了基准测试可能已接近饱和。这些发现强调了在评估Transformer模型时,区分预训练数据与架构创新各自贡献的重要性。
English
Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT's primary advantage being faster training and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.

Summary

AI-Generated Summary

PDF103April 14, 2025