盘古Ultra：在昇腾NPU上突破密集大型语言模型的极限

摘要

我們推出Pangu Ultra，這是一個擁有1350億參數的大型語言模型（LLM），其密集的Transformer模組在昇騰神經處理單元（NPU）上進行訓練。儘管近年來LLM領域在推動模型規模與能力方面取得了前所未有的進展，但訓練如此大規模的模型仍面臨著重大的優化與系統挑戰。為穩定訓練過程，我們提出了深度縮放三明治歸一化方法，有效消除了深度模型訓練過程中的損失尖峰。我們在13.2萬億個多樣且高質量的token上對模型進行了預訓練，並在後續訓練中進一步增強了其推理能力。為高效執行如此大規模的訓練，我們利用了8,192個昇騰NPU，並實施了一系列系統優化。在多樣化基準測試中的評估表明，Pangu Ultra顯著提升了如Llama 405B和Mistral Large 2等密集LLM的現有技術水平，甚至與參數量更為龐大的稀疏模型結構DeepSeek-R1相比，也取得了競爭性的結果。我們的探索證明，昇騰NPU能夠高效且有效地訓練超過1000億參數的密集模型。我們的模型與系統將面向商業客戶開放。

English

We present Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs). Although the field of LLM has been witnessing unprecedented advances in pushing the scale and capability of LLM in recent years, training such a large-scale model still involves significant optimization and system challenges. To stabilize the training process, we propose depth-scaled sandwich normalization, which effectively eliminates loss spikes during the training process of deep models. We pre-train our model on 13.2 trillion diverse and high-quality tokens and further enhance its reasoning capabilities during post-training. To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1, whose sparse model structure contains much more parameters. Our exploration demonstrates that Ascend NPUs are capable of efficiently and effectively training dense models with more than 100 billion parameters. Our model and system will be available for our commercial customers.

盘古Ultra：在昇腾NPU上突破密集大型语言模型的极限

Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs

摘要

Summary

Support

Support