ModernBERT 還是 DeBERTaV3?探討架構與資料對 Transformer 編碼器模型效能的影響
ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance
April 11, 2025
作者: Wissam Antoun, Benoît Sagot, Djamé Seddah
cs.AI
摘要
如DeBERTaV3和ModernBERT等預訓練的Transformer編碼器模型,引入了旨在提升效率和性能的架構創新。儘管ModernBERT的作者報告了在多個基準測試上相較DeBERTaV3的性能提升,但由於未公開訓練數據且缺乏基於共享數據集的比較,難以確定這些增益是源於架構改進還是訓練數據的差異。在本研究中,我們通過在與CamemBERTaV2(一個DeBERTaV3的法語模型)相同的數據集上預訓練ModernBERT,進行了一項對照實驗,以隔離模型設計的影響。我們的結果表明,上一代模型在樣本效率和整體基準性能上仍保持優勢,而ModernBERT的主要優勢在於更快的訓練和推理速度。然而,與BERT和RoBERTa等早期模型相比,新提出的模型仍提供了有意義的架構改進。此外,我們觀察到高質量的預訓練數據加速了收斂,但並未顯著提升最終性能,這暗示了基準測試可能已趨於飽和。這些發現凸顯了在評估Transformer模型時,將預訓練數據與架構創新分離的重要性。
English
Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce
architectural advancements aimed at improving efficiency and performance.
Although the authors of ModernBERT report improved performance over DeBERTaV3
on several benchmarks, the lack of disclosed training data and the absence of
comparisons using a shared dataset make it difficult to determine whether these
gains are due to architectural improvements or differences in training data. In
this work, we conduct a controlled study by pretraining ModernBERT on the same
dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of
model design. Our results show that the previous model generation remains
superior in sample efficiency and overall benchmark performance, with
ModernBERT's primary advantage being faster training and inference speed.
However, the new proposed model still provides meaningful architectural
improvements compared to earlier models such as BERT and RoBERTa. Additionally,
we observe that high-quality pre-training data accelerates convergence but does
not significantly improve final performance, suggesting potential benchmark
saturation. These findings show the importance of disentangling pretraining
data from architectural innovations when evaluating transformer models.Summary
AI-Generated Summary