MiniMax-01:使用閃電關注機制擴展基礎模型

MiniMax-01: Scaling Foundation Models with Lightning Attention

January 14, 2025
作者: MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu
cs.AI

摘要

我們介紹了MiniMax-01系列,包括MiniMax-Text-01和MiniMax-VL-01,這兩款模型與頂尖模型可媲美,同時在處理更長上下文方面具有優越能力。其核心在於閃電注意力及其高效擴展。為了最大化計算能力,我們將其與專家混合(MoE)相結合,創建了一個擁有32位專家和4560億總參數的模型,其中每個標記激活了459億參數。我們為MoE和閃電注意力開發了優化的並行策略和高效的計算-通信重疊技術。這種方法使我們能夠對跨越數百億參數的模型進行有效的訓練和推斷,涵蓋數百萬標記的上下文。MiniMax-Text-01的上下文窗口在訓練期間可達到100萬標記,在推斷期間可擴展到400萬標記,成本合理。我們的視覺語言模型MiniMax-VL-01是通過持續訓練5120億視覺語言標記構建的。在標準和內部基準測試中進行的實驗表明,我們的模型與GPT-4o和Claude-3.5-Sonnet等最先進模型的性能相匹敵,同時提供20-32倍更長的上下文窗口。我們在https://github.com/MiniMax-AI 上公開發布了MiniMax-01。
English
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.

Summary

AI-Generated Summary

PDF2585January 15, 2025