LLäMmlein：從頭開始建立的緊湊且具競爭力的僅限德語語言模型

摘要

我們從頭開始創建了兩個僅限於德語的解碼器模型，LL\"aMmlein 120M 和 1B，並透明地將它們與訓練數據一起發布，供德語自然語言處理研究社區使用。模型訓練包括多個關鍵步驟，包括廣泛的數據預處理、自定義德語分詞器的創建、訓練本身以及對最終模型在各種基準測試中的評估。在整個訓練過程中，我們保存了多個檢查點並使用 SuperGLEBer 基準進行分析，以監控模型的學習動態。與 SuperGLEBer 基準上的最先進模型相比，兩個 LL\"aMmlein 模型表現出競爭力，一致地與具有相似參數大小的模型相匹配或超越。結果顯示，模型的質量隨著大小增加而提高，但在某些任務上的性能改進很快達到瓶頸，為未來模型開發中的資源分配提供了寶貴的見解。

English

We create two German-only decoder models, LL\"aMmlein 120M and 1B, transparently from scratch and publish them, along with the training data, for the German NLP research community to use. The model training involved several key steps, including extensive data preprocessing, the creation of a custom German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks. Throughout the training process, multiple checkpoints were saved and analyzed using the SuperGLEBer benchmark to monitor the models' learning dynamics. Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL\"aMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models' quality scales with size as expected, but performance improvements on some tasks plateaued early, offering valuable insights into resource allocation for future model development.

LLäMmlein：從頭開始建立的緊湊且具競爭力的僅限德語語言模型

LLäMmlein: Compact and Competitive German-Only Language Models from Scratch

摘要

Summary

Support