Hunyuan-Large: Ein Open-Source MoE-Modell mit 52 Milliarden aktivierten Parametern von Tencent.

Zusammenfassung

In diesem Paper stellen wir Hunyuan-Large vor, das derzeit größte Open-Source-Transformer-basierte Mixture-of-Experts-Modell mit insgesamt 389 Milliarden Parametern und 52 Milliarden Aktivierungsparametern, das bis zu 256K Tokens verarbeiten kann. Wir führen eine gründliche Evaluation der überlegenen Leistung von Hunyuan-Large in verschiedenen Benchmarks durch, darunter Sprachverständnis und -erzeugung, logisches Denken, mathematische Problemlösung, Codierung, Langkontext- und aggregierte Aufgaben, bei denen es LLama3.1-70B übertrifft und vergleichbare Leistung im Vergleich zum signifikant größeren LLama3.1-405B-Modell zeigt. Zu den Schlüsselpraktiken von Hunyuan-Large gehören synthetische Daten im großen Maßstab, die um Größenordnungen größer sind als in früheren Literaturquellen, eine gemischte Experten-Routing-Strategie, eine Schlüssel-Wert-Cache-Komprimierungstechnik und eine expertenspezifische Lernratenstrategie. Darüber hinaus untersuchen wir die Skalierungsgesetze und Lernratenpläne von Mixture-of-Experts-Modellen und liefern wertvolle Einblicke und Anleitungen für zukünftige Modellentwicklung und -optimierung. Der Code und die Checkpoints von Hunyuan-Large werden veröffentlicht, um zukünftige Innovationen und Anwendungen zu erleichtern. Code: https://github.com/Tencent/Hunyuan-Large Modelle: https://huggingface.co/tencent/Tencent-Hunyuan-Large

English

In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large

Hunyuan-Large: Ein Open-Source MoE-Modell mit 52 Milliarden aktivierten Parametern von Tencent.

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Zusammenfassung

Summary

Support