MiniMax-01:使用闪电注意力扩展基础模型

MiniMax-01: Scaling Foundation Models with Lightning Attention

January 14, 2025
作者: MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu
cs.AI

摘要

我们介绍MiniMax-01系列,包括MiniMax-Text-01和MiniMax-VL-01,这些模型可与顶尖模型相媲美,同时在处理更长上下文方面具有卓越能力。其核心在于闪电注意力及其高效扩展。为了最大化计算能力,我们将其与专家混合(MoE)相结合,创建了一个拥有32位专家和4560亿总参数的模型,其中每个标记激活了459亿个参数。我们开发了一种优化的并行策略和高效的计算-通信重叠技术,用于MoE和闪电注意力。这种方法使我们能够在跨越数百万标记的上下文中进行拥有数千亿参数的模型的高效训练和推断。MiniMax-Text-01的上下文窗口在训练期间可达到100万个标记,并在推断期间以可负担的成本扩展到400万个标记。我们的视觉语言模型MiniMax-VL-01是通过持续训练获得的,包括5120亿视觉语言标记。对标准和内部基准的实验表明,我们的模型与GPT-4o和Claude-3.5-Sonnet等最先进模型的性能相匹配,同时提供20-32倍更长的上下文窗口。我们在https://github.com/MiniMax-AI 上公开发布了MiniMax-01。
English
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.

Summary

AI-Generated Summary

PDF2585January 15, 2025