YuLan-Mini：一种开放的数据高效语言模型

摘要

由于巨大的资源需求和涉及的技术过程的复杂性，大型语言模型（LLMs）的有效预训练一直是具有挑战性的。本文提供了关于YuLan-Mini的详细技术报告，这是一个具有242亿参数的高性能基础模型，在类似参数规模的模型中实现了顶尖性能。我们的预训练方法侧重于通过三个关键技术贡献提高训练效果：一个精心设计的数据管道结合了数据清洗和数据调度策略，一个强大的优化方法来减轻训练不稳定性，以及一个有效的退火方法，其中包括有针对性的数据选择和长上下文训练。值得注意的是，YuLan-Mini在训练了1.08T个标记的情况下，实现了与行业领先模型相媲美的性能，而这些模型需要更多的数据。为了便于复现，我们发布了每个训练阶段数据组成的完整细节。项目详细信息可在以下链接获取：https://github.com/RUC-GSAI/YuLan-Mini。

English

Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.

YuLan-Mini：一种开放的数据高效语言模型

YuLan-Mini: An Open Data-efficient Language Model

摘要

Support