YuLan-Mini:一個開放的資料節約型語言模型
YuLan-Mini: An Open Data-efficient Language Model
December 23, 2024
作者: Yiwen Hu, Huatong Song, Jia Deng, Jiapeng Wang, Jie Chen, Kun Zhou, Yutao Zhu, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Ji-Rong Wen
cs.AI
摘要
由於龐大的資源需求和涉及的技術過程的複雜性,對大型語言模型(LLMs)進行有效的預訓練一直是一項具有挑戰性的任務。本文提供了有關YuLan-Mini的詳細技術報告,這是一個具有24.2億參數的高性能基礎模型,其在類似參數規模的模型中實現了頂尖性能。我們的預訓練方法著重於通過三個關鍵技術貢獻來增強訓練效果:一個精心設計的數據管道結合了數據清理和數據調度策略,一種強大的優化方法來減輕訓練不穩定性,以及一種有效的退火方法,其中包括有針對性的數據選擇和長上下文訓練。值得注意的是,YuLan-Mini在訓練了1080億標記的情況下,實現了與行業領先模型相媲美的性能,而這些模型需要更多的數據。為了便於重現,我們釋放了每個訓練階段的數據組成的詳細信息。項目詳情可在以下鏈接中訪問:https://github.com/RUC-GSAI/YuLan-Mini。
English
Effective pre-training of large language models (LLMs) has been challenging
due to the immense resource demands and the complexity of the technical
processes involved. This paper presents a detailed technical report on
YuLan-Mini, a highly capable base model with 2.42B parameters that achieves
top-tier performance among models of similar parameter scale. Our pre-training
approach focuses on enhancing training efficacy through three key technical
contributions: an elaborate data pipeline combines data cleaning with data
schedule strategies, a robust optimization method to mitigate training
instability, and an effective annealing approach that incorporates targeted
data selection and long context training. Remarkably, YuLan-Mini, trained on
1.08T tokens, achieves performance comparable to industry-leading models that
require significantly more data. To facilitate reproduction, we release the
full details of the data composition for each training phase. Project details
can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.Summary
AI-Generated Summary