Time-MoE:具有專家混合的十億規模時間序列基礎模型
Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
September 24, 2024
作者: Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, Ming Jin
cs.AI
摘要
在過去幾十年裡,用於時間序列預測的深度學習取得了顯著進展。然而,儘管在語言和視覺領域取得了大規模預訓練的成功,預訓練的時間序列模型在規模上仍然受限且運作成本高昂,阻礙了在實際應用中開發更大能力的預測模型。為此,我們引入了Time-MoE,這是一種可擴展且統一的架構,旨在預訓練更大、更具能力的預測基礎模型,同時降低推斷成本。通過利用稀疏的專家混合(MoE)設計,Time-MoE通過僅激活每次預測的部分網絡來提高計算效率,減少計算負載同時保持高模型容量。這使得Time-MoE能夠有效擴展,而無需相應增加推斷成本。Time-MoE包括一系列僅解碼器的變壓器模型,以自回歸方式運作,支持具有不同輸入上下文長度的靈活預測視野。我們在我們新引入的大規模數據集Time-300B上對這些模型進行了預訓練,該數據集跨越9個領域,包含超過3000億個時間點。我們首次將時間序列基礎模型擴展到24億個參數,實現了顯著改善的預測精度。我們的結果驗證了在時間序列預測的背景下,對訓練令牌和模型大小的擴展定律的適用性。與具有相同激活參數數量或等效計算預算的密集模型相比,我們的模型始終以較大的優勢表現。這些進展將Time-MoE定位為應對現實世界時間序列預測挑戰的最先進解決方案,具有卓越的能力、效率和靈活性。
English
Deep learning for time series forecasting has seen significant advancements
over the past decades. However, despite the success of large-scale pre-training
in language and vision domains, pre-trained time series models remain limited
in scale and operate at a high cost, hindering the development of larger
capable forecasting models in real-world applications. In response, we
introduce Time-MoE, a scalable and unified architecture designed to pre-train
larger, more capable forecasting foundation models while reducing inference
costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE
enhances computational efficiency by activating only a subset of networks for
each prediction, reducing computational load while maintaining high model
capacity. This allows Time-MoE to scale effectively without a corresponding
increase in inference costs. Time-MoE comprises a family of decoder-only
transformer models that operate in an auto-regressive manner and support
flexible forecasting horizons with varying input context lengths. We
pre-trained these models on our newly introduced large-scale data Time-300B,
which spans over 9 domains and encompassing over 300 billion time points. For
the first time, we scaled a time series foundation model up to 2.4 billion
parameters, achieving significantly improved forecasting precision. Our results
validate the applicability of scaling laws for training tokens and model size
in the context of time series forecasting. Compared to dense models with the
same number of activated parameters or equivalent computation budgets, our
models consistently outperform them by large margin. These advancements
position Time-MoE as a state-of-the-art solution for tackling real-world time
series forecasting challenges with superior capability, efficiency, and
flexibility.Summary
AI-Generated Summary