This code provides a tool to generate synthetic time series using some of the most common techniques. The sources of the codes used are listed here
This repository only contains the sharable (public) part of the code. Since the concatenation tool to reconstruct the results of GANs is absent, the code might not work properly with GAN-based generations
Prerequisites | Build | Execution | Results | Examples | Plot metrics | BasicGAN 3072 Note | Sources | Datasets form
*How to install python3.6 for Ubuntu 20.04
- Install the needed libraries and create the virtual environments (takes several minutes)
$ sudo sh install_all.sh
- Change the parameters in parameters.json if needed, then run with
$ python3.6 main.py
--algorithm | --dataset | --nb_epochs | --batch_size | --TimeGAN_seq_len | --Kalman_filter | --Compute_metrics | --Show_plot |
---|---|---|---|---|---|---|---|
DBA | 'datasets/Original_Data/BeetleFly_TEST.csv' | int > 0 | int > 0 | int > 0 | 1 (-> apply) | 1 (-> compute) | 1 (-> show) |
InfoGAN | 'datasets/Original_Data/Coffee_TEST.csv' | 0 (-> do not apply) | 0 (-> do not compute) | 0 (-> do not show) | |||
TimeGAN | 'datasets/Original_Data/Ham_TEST.csv' | ||||||
AnomaliesInjection | 'datasets/Original_Data/Lighting7_TEST.csv' | ||||||
AR | 'datasets/Original_Data/Alabama_weather_6k_8k.csv' | ||||||
'datasets/Original_Data/Currency2.csv' |
For each new dataset, a separate folder is created in the 'results' folder. Inside it, the files 'precision.csv' and 'runtime.csv' groups the statistics for all the generation techniques used with the given dataset. 'data' contains a csv file with the output of each technique, and 'plots' contains a png image with the plot of each technique.
Example after running each of the 5 algorithms on "BeetleFly_TEST.csv" dataset:
./results/
├── BeetleFly_TEST
│ ├── data
│ │ ├── BeetleFly_TEST_AnomaliesInjection.csv
│ │ ├── BeetleFly_TEST_AR.csv
│ │ ├── BeetleFly_TEST_DBA.csv
│ │ ├── BeetleFly_TEST_InfoGAN.csv
│ │ └── BeetleFly_TEST_TimeGAN.csv
│ ├── plots
│ │ ├── BeetleFly_TEST_AnomaliesInjection.png
│ │ ├── BeetleFly_TEST_AR.png
│ │ ├── BeetleFly_TEST_DBA.png
│ │ ├── BeetleFly_TEST_InfoGAN.png
│ │ └── BeetleFly_TEST_TimeGAN.png
│ ├── precision.csv
│ └── runtime.csv
└── placeholder.txt
- Generate 5 time series of length 100 using AnomaliesInjection algorithm, with dataset "Coffee_TEST.csv" as input:
$ python3.6 main.py --dataset 'datasets/Original_Data/Coffee_TEST.csv' --algorithm AnomaliesInjection --length 5 --nb_series 100
- Run the InfoGAN algorithm on the "Currency2" dataset, for 300 epochs and with a batch of size 200. Then apply the Kalman filter and compute the metrics:
$ python3.6 main.py --dataset 'datasets/Original_Data/Currency2.csv' --algorithm InfoGAN --nb_epochs 300 --batch_size 200 --Kalman_filter 1 --Compute_metrics 1 --Show_plot 0
- Run both DBA and AnomaliesInjection on the "BeetleFly_TEST" dataset. For the other parameters, use the values specifies in parameters.json:
$ python3.6 main.py --dataset 'datasets/Original_Data/BeetleFly_TEST.csv' --algorithm DBA AnomaliesInjection
The parameters for the synthetic data generation are stored in the file parameters.json, in the main folder (Time_Series_Generation_Benchmark). This can therefore be modified in order to adapt the generation. Some parameter, for example those for AnomaliesInjection, can only be set through this file, and not directly when the code is runned
(It is to notice that the tsgen implementation also use a parameters.json file, but this is partially overwritten when the code is runned, therefore it might not be usefull to modify it)
Parameters specific for AutoregressiveModel:
Parameter name | Usage | Possible values |
---|---|---|
AR_lag_window | The lag window to use with the AR model | Integer number > 0, 0 for default (1/4 of the ts length) |
Parameters specific for Kalman filter:
Parameter name | Usage | Possible values |
---|---|---|
Kalman_remove_initial | Number of points to remove from the start of the time series after applying filter | Integer number >= 0 |
Parameters specific for AnomaliesInjection:
Parameter name | Usage | Possible values |
---|---|---|
AnomaliesInjection_nb_modifications | The number of anomalies to insert in the dataset | Integer number > 0, or -1 to use default value (average of 1 anomaly per each ts in the dataset) |
AnomaliesInjection_multiple_modification_per_ts | Determines if more than 1 anomaly can be inserted in the same ts | 1 for True, 0 for False. If 0, AnomaliesInjection_nb_modifications should be smaller than the number of ts in the dataset |
AnomaliesInjection_seed | Seed for the random generation | Integer number > 0 |
AnomaliesInjection_max_nb_extreme | Maximal number of extreme points (spikes) for each extreme anomaly | Integer number > 0 |
AnomaliesInjection_min/max_shift/trend/variance | The minimal/maximal length of each shift/trend/variance anomaly | Integer number > 0. For each anomaly type, the max value should be strictly bigger than the min value |
AnomaliesInjection_extreme/shift/trend/variance_factor | The "intensity" of each extreme/shift/trend/variance anomaly | Integer number > 0 |
AnomaliesInjection_probability_extreme/shift/trend/variance | The probability of each anomaly to be of type extreme/shift/trend/variance | Float number >= 0. If the sum of the 4 probabilities is different than 1, they will be reascaled to ensure this property. If they are all 0, default probabilities (0.25 each) are used |
In an "extreme" anomaly, a point is modified to have a much bigger/smaller value that the original one, thus resulting in a spike when the time series is plotted.
In a "shift" anomaly, all the the records in a given interval are shifted by a given value, which is equal in every point. The result is that a part of the time series is shifted up or down.
In a "trend" anomaly, a trend is inserted at a given point in the time series. In other words, an increasing (or decreasing) sequence of values is added to a portion of the time series. For example, a time series [1,1,1,1,1,1,1,1,1,1] might become [1,1,1,2,3,4,4,4,4,4]. Notice that after the trend part ([2,3,4]), all the values are modified in order to continue "directly" from the last point (in this example, they are all increased by 3).
In a variance anomaly, the variance of a random interval is augmented. Visually, this results in something similar to a "vibration".
Once the data has been generated, plot metrics with:
$ python3.6 plot_metrics.py
It is possible to select only specific datasets by indicating their result folder:
$ python3.6 plot_metrics.py --dataset results/BeetleFly_TEST results/Coffee_TEST results/Currency2
The code provides an implementation for BasicGAN as well. However, this do not work with any of the included datasets.
It requires a dataset with 3072 time series of length 3072.
It has been included for consistency with the paper, but its runtime is not direclty measured and the metrics should be extracted manually from the output.
The codes used in this repo are adapted versions of:
- Exascale tsgen -> InfoGAN
- jsyoon0823 TimeGAN -> TimeGAN
- KDD_OpenSource agots -> AnomaliesInjection
- Generating synthetic time series to augment sparse datasets -> DBA
- Machine Learning Master -> AR
- dbiir TS-Benchmark -> BasicGAN3072
col_name1, col_name2, col_name3, ..., col_namej
val11, val21, val31, ..., val1j
val21, val22, val32, ..., val2j
..., ..., ..., ..., ...
vali1, vali2, vali3, ..., valij
In particular, the values should be separated by a ',', and each of the j time series should have the same length