MONSTER: Monash Scalable Time Series Evaluation Repository
Abstract
We introduce MONSTER-the MONash Scalable Time Series Evaluation Repository-a collection of large datasets for time series classification. The field of time series classification has benefitted from common benchmarks set by the UCR and UEA time series classification repositories. However, the datasets in these benchmarks are small, with median sizes of 217 and 255 examples, respectively. In consequence they favour a narrow subspace of models that are optimised to achieve low classification error on a wide variety of smaller datasets, that is, models that minimise variance, and give little weight to computational issues such as scalability. Our hope is to diversify the field by introducing benchmarks using larger datasets. We believe that there is enormous potential for new progress in the field by engaging with the theoretical and practical challenges of learning effectively from larger quantities of data.
Summary
AI-Generated Summary
Paper Overview
Core Contribution
- Introduces MONSTER (MONash Scalable Time Series Evaluation Repository), a collection of large datasets for time series classification.
- Aims to diversify the field by introducing benchmarks using larger datasets, addressing the limitations of existing small datasets in the UCR and UEA archives.
- Encourages engagement with theoretical and practical challenges of learning from larger quantities of data.
Research Context
- Current benchmarks in time series classification are dominated by small datasets, favoring models optimized for low classification error on small datasets.
- Larger datasets are needed to reflect real-world challenges and enable the use of low-bias models like deep neural networks.
- MONSTER complements existing datasets and aims to inspire progress in the field by focusing on scalability and computational efficiency.
Keywords
- Time series classification
- Dataset
- Benchmark
- Bitter lesson
- Scalability
Background
Research Gap
- Existing benchmarks (UCR and UEA archives) focus on small datasets, limiting the exploration of models that perform well on larger datasets.
- Lack of large-scale datasets hinders the development of models that can handle real-world, large-scale time series classification tasks.
Technical Challenges
- Variance minimization is prioritized in small datasets, leading to high-bias models.
- Computational scalability becomes a critical issue when dealing with larger datasets.
- Current benchmarks do not reflect the challenges of learning from large-scale real-world data.
Prior Approaches
- UCR and UEA archives have been foundational for time series classification but are limited by small dataset sizes.
- Deep learning methods have had limited impact due to insufficient training data in existing benchmarks.
- Methods like InceptionTime, HIVE-COTE, and Rocket have been optimized for small datasets.
Methodology
Technical Architecture
- MONSTER includes 29 univariate and multivariate datasets with sizes ranging from 10,299 to 59,268,823 time series.
- Datasets are categorized into audio, satellite, EEG, HAR, count, and other domains.
- Data is provided in .npy and .csv formats, with 5-fold cross-validation indices for consistent evaluation.
Implementation Details
- Datasets are processed to ensure consistency, including interpolation, resampling, and labeling.
- Cross-validation folds are stratified or based on metadata (e.g., geographic location, experimental subjects).
- Baseline results are provided for several models, including deep learning and traditional methods.
Innovation Points
- Introduction of large-scale datasets to address the limitations of small datasets in current benchmarks.
- Encourages the use of low-bias models and explores the challenges of scalability and computational efficiency.
- Provides a new benchmark that better reflects real-world time series classification tasks.
Results
Experimental Setup
- Baseline models include ConvTran, FCN, HInceptionTime, TempCNN, Hydra, Quant, and Extremely Randomised Trees (ET).
- Models are evaluated using 5-fold cross-validation, with metrics including 0–1 loss, log loss, and training time.
Key Findings
- Quant achieves the lowest overall mean 0–1 loss, followed by ConvTran and HInceptionTime.
- ConvTran and HInceptionTime perform well on audio datasets, while ET and Quant perform well on count datasets.
- Hydra is the fastest method in terms of training time, while HInceptionTime is the slowest.
- FCN and TempCNN struggle with audio datasets due to their small receptive fields.
Limitations
- Some models (e.g., FCN, TempCNN) perform poorly on certain datasets, particularly audio datasets.
- Training time and computational resources are significant challenges for larger datasets.
- The benchmark is still in its initial release, with plans to add more datasets in the future.