CamemBERT 2.0:一個更智能的法語語言模型,經過完美的陳年。
CamemBERT 2.0: A Smarter French Language Model Aged to Perfection
November 13, 2024
作者: Wissam Antoun, Francis Kulumba, Rian Touchent, Éric de la Clergerie, Benoît Sagot, Djamé Seddah
cs.AI
摘要
法語語言模型,如CamemBERT,已被廣泛應用於各行業的自然語言處理(NLP)任務中,像CamemBERT這樣的模型每月下載量超過400萬次。然而,這些模型面臨時間概念漂移的挑戰,即過時的訓練數據導致性能下降,特別是在遇到新主題和術語時。這個問題強調了需要反映當前語言趨勢的更新模型。在本文中,我們介紹了CamemBERT基本模型的兩個新版本-CamemBERTav2和CamemBERTv2,旨在應對這些挑戰。CamemBERTav2基於DeBERTaV3架構,利用了替換標記檢測(RTD)目標以獲得更好的上下文理解,而CamemBERTv2則基於RoBERTa構建,使用了遮罩語言建模(MLM)目標。這兩個模型均在一個更大且更新的數據集上進行訓練,上下文長度更長,並使用了增強法語分詞性能的更新分詞器。我們對這些模型在通用領域NLP任務和特定領域應用(如醫學領域任務)上的性能進行評估,展示了它們在各種用例中的多功能性和有效性。我們的結果顯示,這些更新的模型遠遠優於它們的前身,使它們成為現代NLP系統的寶貴工具。我們所有的新模型以及中間檢查點都在Huggingface上公開提供。
English
French language models, such as CamemBERT, have been widely adopted across
industries for natural language processing (NLP) tasks, with models like
CamemBERT seeing over 4 million downloads per month. However, these models face
challenges due to temporal concept drift, where outdated training data leads to
a decline in performance, especially when encountering new topics and
terminology. This issue emphasizes the need for updated models that reflect
current linguistic trends. In this paper, we introduce two new versions of the
CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these
challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use
of the Replaced Token Detection (RTD) objective for better contextual
understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked
Language Modeling (MLM) objective. Both models are trained on a significantly
larger and more recent dataset with longer context length and an updated
tokenizer that enhances tokenization performance for French. We evaluate the
performance of these models on both general-domain NLP tasks and
domain-specific applications, such as medical field tasks, demonstrating their
versatility and effectiveness across a range of use cases. Our results show
that these updated models vastly outperform their predecessors, making them
valuable tools for modern NLP systems. All our new models, as well as
intermediate checkpoints, are made openly available on Huggingface.Summary
AI-Generated Summary