훈원-라지: 텐센트가 개발한 520억 개 활성화된 매개변수를 가진 오픈소스 MoE 모델

초록

본 논문에서는 현재 가장 큰 오픈 소스 Transformer 기반 전문가 모델인 훈원-라지(Hunyuan-Large)를 소개합니다. 이 모델은 총 3890억 개의 파라미터와 520억 개의 활성화 파라미터를 갖추고 있으며, 최대 256,000개의 토큰을 처리할 수 있습니다. 훈원-라지의 우수한 성능을 입증하기 위해 언어 이해 및 생성, 논리적 추론, 수학 문제 해결, 코딩, 장기 문맥, 그리고 집계 작업을 포함한 다양한 벤치마크에서 철저한 평가를 실시했습니다. 이 모델은 LLama3.1-70B를 능가하며, 상당히 큰 LLama3.1-405B 모델과 비교했을 때 유사한 성능을 보여줍니다. 훈원-라지의 주요 특징은 이전 문헌보다 훨씬 많은 규모의 합성 데이터, 혼합 전문가 라우팅 전략, 키-값 캐시 압축 기술, 그리고 전문가별 학습률 전략을 포함합니다. 또한, 전문가 모델의 스케일링 법칙과 학습률 일정에 대해 조사하여 미래 모델 개발과 최적화를 위한 유용한 통찰과 지침을 제공했습니다. 훈원-라지의 코드와 체크포인트는 미래 혁신과 응용을 용이하게 하기 위해 공개되었습니다. 코드: https://github.com/Tencent/Hunyuan-Large 모델: https://huggingface.co/tencent/Tencent-Hunyuan-Large

English

In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large

훈원-라지: 텐센트가 개발한 520억 개 활성화된 매개변수를 가진 오픈소스 MoE 모델

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

초록

Summary

Support