SmolLM2:当Smol变得强大——小型语言模型的数据中心训练
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
February 4, 2025
作者: Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf
cs.AI
摘要
尽管大型语言模型在许多人工智能应用中取得了突破,但它们固有的庞大使得它们在资源受限的环境中具有计算成本高和部署具有挑战性的特点。本文记录了SmolLM2的开发过程,这是一种最先进的“小型”(17亿参数)语言模型(LM)。为了获得强大的性能,我们使用多阶段训练过程在约11万亿个数据标记上过度训练了SmolLM2,其中混合了网络文本和专门的数学、代码和遵循指令的数据。我们还在发现现有数据集过小或质量低的阶段引入了新的专门数据集(FineMath、Stack-Edu和SmolTalk)。为了指导我们的设计决策,我们进行了小规模消融实验,以及根据前一阶段的性能更新每个阶段的数据集混合比例的手动细化过程。最终,我们证明了SmolLM2优于其他最近的小型LM,包括Qwen2.5-1.5B和Llama3.2-1B。为了促进LM开发的未来研究以及小型LM的应用,我们发布了SmolLM2以及在项目过程中准备的所有数据集。
English
While large language models have facilitated breakthroughs in many
applications of artificial intelligence, their inherent largeness makes them
computationally expensive and challenging to deploy in resource-constrained
settings. In this paper, we document the development of SmolLM2, a
state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain
strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a
multi-stage training process that mixes web text with specialized math, code,
and instruction-following data. We additionally introduce new specialized
datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing
datasets to be problematically small or low-quality. To inform our design
decisions, we perform both small-scale ablations as well as a manual refinement
process that updates the dataset mixing rates at each stage based on the
performance at the previous stage. Ultimately, we demonstrate that SmolLM2
outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To
facilitate future research on LM development as well as applications of small
LMs, we release both SmolLM2 as well as all of the datasets we prepared in the
course of this project.Summary
AI-Generated Summary