Steel-LLM:从零开始到开源 —— 在构建一个以中文为中心的LLM中的个人经历
Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM
February 10, 2025
作者: Qingshui Gu, Shu Li, Tianyu Zheng, Zhaoxiang Zhang
cs.AI
摘要
Steel-LLM是一个以中文为中心的语言模型,从零开始开发,旨在在计算资源有限的情况下创建一个高质量的开源模型。该项目于2024年3月启动,旨在在大规模数据集上训练一个10亿参数的模型,优先考虑透明度和分享实用见解,以帮助社区中的其他人。训练过程主要关注中文数据,包括少量英文数据,填补了现有开源LLM的空白,提供了更详细和实用的模型构建过程描述。Steel-LLM在CEVAL和CMMLU等基准测试中表现出色,胜过了来自大型机构的早期模型。本文全面总结了该项目的关键贡献,包括数据收集、模型设计、训练方法以及沿途遇到的挑战,为希望开发自己的LLM的研究人员和从业者提供了宝贵资源。模型检查点和训练脚本可在https://github.com/zhanshijinwat/Steel-LLM 上找到。
English
Steel-LLM is a Chinese-centric language model developed from scratch with the
goal of creating a high-quality, open-source model despite limited
computational resources. Launched in March 2024, the project aimed to train a
1-billion-parameter model on a large-scale dataset, prioritizing transparency
and the sharing of practical insights to assist others in the community. The
training process primarily focused on Chinese data, with a small proportion of
English data included, addressing gaps in existing open-source LLMs by
providing a more detailed and practical account of the model-building journey.
Steel-LLM has demonstrated competitive performance on benchmarks such as CEVAL
and CMMLU, outperforming early models from larger institutions. This paper
provides a comprehensive summary of the project's key contributions, including
data collection, model design, training methodologies, and the challenges
encountered along the way, offering a valuable resource for researchers and
practitioners looking to develop their own LLMs. The model checkpoints and
training script are available at https://github.com/zhanshijinwat/Steel-LLM.Summary
AI-Generated Summary