YuLan-Mini：一個開放的資料節約型語言模型

摘要

由於龐大的資源需求和涉及的技術過程的複雜性，對大型語言模型（LLMs）進行有效的預訓練一直是一項具有挑戰性的任務。本文提供了有關YuLan-Mini的詳細技術報告，這是一個具有24.2億參數的高性能基礎模型，其在類似參數規模的模型中實現了頂尖性能。我們的預訓練方法著重於通過三個關鍵技術貢獻來增強訓練效果：一個精心設計的數據管道結合了數據清理和數據調度策略，一種強大的優化方法來減輕訓練不穩定性，以及一種有效的退火方法，其中包括有針對性的數據選擇和長上下文訓練。值得注意的是，YuLan-Mini在訓練了1080億標記的情況下，實現了與行業領先模型相媲美的性能，而這些模型需要更多的數據。為了便於重現，我們釋放了每個訓練階段的數據組成的詳細信息。項目詳情可在以下鏈接中訪問：https://github.com/RUC-GSAI/YuLan-Mini。

English

Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.

YuLan-Mini：一個開放的資料節約型語言模型

YuLan-Mini: An Open Data-efficient Language Model

摘要

Summary

Support