基于人口意识的扩散用于时间序列生成

摘要

扩散模型在生成高质量时间序列（TS）数据方面展现出了很有前景的能力。尽管最初取得了成功，现有研究主要集中在个体级别数据的真实性上，但较少关注保留整个数据集上的人群级别特性。这种人群级别特性包括每个维度的值分布以及不同维度之间某些功能依赖（例如交叉相关，CC）的分布。例如，在生成房屋能耗时间序列数据时，应该保留室外温度和厨房温度的值分布，以及它们之间的CC分布。保留这些TS人群级别特性对于保持数据集的统计洞察力、减轻模型偏差以及增强诸如TS预测等下游任务至关重要。然而，现有模型往往忽视了这一点。因此，现有模型生成的数据往往与原始数据存在分布偏移。我们提出了一种新的时间序列生成模型，名为Population-aware Diffusion for Time Series（PaD-TS），它更好地保留了人群级别特性。PaD-TS的关键创新包括1）明确纳入TS人群级别特性保留的新训练方法，以及2）更好地捕捉TS数据结构的新双通道编码器模型架构。在主要基准数据集上的实证结果显示，PaD-TS可以将真实数据和合成数据之间的平均CC分布偏移得分提高5.9倍，同时保持与个体级别真实性的最新模型相媲美的性能。

English

Diffusion models have shown promising ability in generating high-quality time series (TS) data. Despite the initial success, existing works mostly focus on the authenticity of data at the individual level, but pay less attention to preserving the population-level properties on the entire dataset. Such population-level properties include value distributions for each dimension and distributions of certain functional dependencies (e.g., cross-correlation, CC) between different dimensions. For instance, when generating house energy consumption TS data, the value distributions of the outside temperature and the kitchen temperature should be preserved, as well as the distribution of CC between them. Preserving such TS population-level properties is critical in maintaining the statistical insights of the datasets, mitigating model bias, and augmenting downstream tasks like TS prediction. Yet, it is often overlooked by existing models. Hence, data generated by existing models often bear distribution shifts from the original data. We propose Population-aware Diffusion for Time Series (PaD-TS), a new TS generation model that better preserves the population-level properties. The key novelties of PaD-TS include 1) a new training method explicitly incorporating TS population-level property preservation, and 2) a new dual-channel encoder model architecture that better captures the TS data structure. Empirical results in major benchmark datasets show that PaD-TS can improve the average CC distribution shift score between real and synthetic data by 5.9x while maintaining a performance comparable to state-of-the-art models on individual-level authenticity.

基于人口意识的扩散用于时间序列生成

Population Aware Diffusion for Time Series Generation

摘要

Support