YuLan-Mini:一种开放的数据高效语言模型
YuLan-Mini: An Open Data-efficient Language Model
December 23, 2024
作者: Yiwen Hu, Huatong Song, Jia Deng, Jiapeng Wang, Jie Chen, Kun Zhou, Yutao Zhu, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Ji-Rong Wen
cs.AI
摘要
由于巨大的资源需求和涉及的技术过程的复杂性,大型语言模型(LLMs)的有效预训练一直是具有挑战性的。本文提供了关于YuLan-Mini的详细技术报告,这是一个具有242亿参数的高性能基础模型,在类似参数规模的模型中实现了顶尖性能。我们的预训练方法侧重于通过三个关键技术贡献提高训练效果:一个精心设计的数据管道结合了数据清洗和数据调度策略,一个强大的优化方法来减轻训练不稳定性,以及一个有效的退火方法,其中包括有针对性的数据选择和长上下文训练。值得注意的是,YuLan-Mini在训练了1.08T个标记的情况下,实现了与行业领先模型相媲美的性能,而这些模型需要更多的数据。为了便于复现,我们发布了每个训练阶段数据组成的完整细节。项目详细信息可在以下链接获取:https://github.com/RUC-GSAI/YuLan-Mini。
English
Effective pre-training of large language models (LLMs) has been challenging
due to the immense resource demands and the complexity of the technical
processes involved. This paper presents a detailed technical report on
YuLan-Mini, a highly capable base model with 2.42B parameters that achieves
top-tier performance among models of similar parameter scale. Our pre-training
approach focuses on enhancing training efficacy through three key technical
contributions: an elaborate data pipeline combines data cleaning with data
schedule strategies, a robust optimization method to mitigate training
instability, and an effective annealing approach that incorporates targeted
data selection and long context training. Remarkably, YuLan-Mini, trained on
1.08T tokens, achieves performance comparable to industry-leading models that
require significantly more data. To facilitate reproduction, we release the
full details of the data composition for each training phase. Project details
can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.Summary
AI-Generated Summary