盘古超大规模模型：在昇腾NPU上突破密集大语言模型的极限

摘要

我们推出Pangu Ultra，这是一款拥有1350亿参数的大型语言模型（LLM），其密集的Transformer模块在昇腾神经网络处理器（NPU）上完成训练。尽管近年来LLM领域在模型规模与能力拓展上取得了前所未有的进展，但训练如此大规模模型仍面临显著的优化与系统挑战。为确保训练过程稳定，我们提出了深度缩放三明治归一化方法，有效消除了深度模型训练过程中的损失尖峰。我们的模型在13.2万亿个多样且高质量的token上进行了预训练，并在后续训练中进一步增强了其推理能力。为高效执行如此大规模的训练，我们采用了8192个昇腾NPU，并实施了一系列系统优化。在多个多样化基准测试中的评估表明，Pangu Ultra显著推进了如Llama 405B和Mistral Large 2等密集LLM的最新技术水平，甚至与参数规模更大的稀疏模型结构DeepSeek-R1相比，也取得了具有竞争力的成果。我们的探索证明，昇腾NPU能够高效且有效地训练超过1000亿参数的密集模型。我们的模型与系统将面向商业客户开放。

English

We present Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs). Although the field of LLM has been witnessing unprecedented advances in pushing the scale and capability of LLM in recent years, training such a large-scale model still involves significant optimization and system challenges. To stabilize the training process, we propose depth-scaled sandwich normalization, which effectively eliminates loss spikes during the training process of deep models. We pre-train our model on 13.2 trillion diverse and high-quality tokens and further enhance its reasoning capabilities during post-training. To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1, whose sparse model structure contains much more parameters. Our exploration demonstrates that Ascend NPUs are capable of efficiently and effectively training dense models with more than 100 billion parameters. Our model and system will be available for our commercial customers.

盘古超大规模模型：在昇腾NPU上突破密集大语言模型的极限

Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs

摘要

Summary

Support

Support