Skip to content

Commit

Permalink
Update descriptions of 3W Dataset's structure, now based on Parquet f…
Browse files Browse the repository at this point in the history
…iles
  • Loading branch information
ricardoevvargas committed Jul 26, 2024
1 parent e96b2bb commit 5adc6e8
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 5 deletions.
9 changes: 7 additions & 2 deletions 3W_DATASET_STRUCTURE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
The 3W Dataset consists of multiple CSV files saved in the [dataset](dataset) directory and structured as follows.
The 3W Dataset consists of multiple Parquet files saved in the [dataset](dataset) directory and structured as follows.

There are two types of subdirectory:

* The [folds](dataset/folds) subdirectory holds all 3W Dataset configuration files. For each specific project released in the 3W Project there will be a file that will specify how and which data must be loaded for training and testing in multiple folds of experimentation. This scheme allows implementation of cross validation and hyperparameter optimization by the 3W Toolkit users. In addition, this scheme allows the user to choose some specific characteristics to the desired experiment. For example: whether or not simulated and/or hand-drawn intances should be considered in the training set. It is important to clarify that specifying which instances make up which folds will always be random but fixed in each configuration file. This is considered necessary so that results obtained for the same problem with different approaches can be compared;
* The other subdirectories holds all 3W Dataset data files. The subdirectory names are the instances' labels. Each file represents one instance. The filename reveals its source. All files are standardized as follow. There are one observation per line and one series per column. Columns are separated by commas and decimals are separated by periods. The first column contains timestamps, the last one reveals the observations' labels, and the other columns are the Multivariate Time Series (MTS) (i.e. the instance itself).
* The other subdirectories holds all 3W Dataset data files. The subdirectory names are the instances' labels. Each file represents one instance. The filename reveals its source. All files are standardized as follows:
* All Parquet files are created and read with pandas functions, `pyarrow` engine and `brotli` compression;
* For each instance, timestamps corresponding to observations are stored in Parquet file as its index and loaded into pandas DataFrame as its index;
* Each observation is stored in a line of a Parquet file and loaded as a line of a pandas DataFrame;
* All variables are stored as float in columns of Parquet files and loaded as float in columns of pandas DataFrame;
* All labels are stored as `Int64` (not `int64`) in columns of Parquet files and loaded as `Int64` (not `int64`) in columns of pandas DataFrame.
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ We expect to receive contributions at different levels, as shown in the figure b

## 3W Dataset's structure

At level 1, the 3W Dataset consists of all CSV files in the subdirectories of the [dataset](dataset) directory and structured as detailed [here](3W_DATASET_STRUCTURE.md).
At level 1, the 3W Dataset consists of multiple Parquet files saved in subdirectories of the [dataset](dataset) directory and structured as detailed [here](3W_DATASET_STRUCTURE.md).

## 3W Toolkit's structure

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ It is also very important to know, participate and follow the discussions. See t

## Licenses

All the code of this project is licensed under the [Apache 2.0 License][apache] and all 3W Dataset data files (CSV files in the subdirectories of the [dataset](dataset) directory) are licensed under the [Creative Commons Attribution 4.0 International License][cc-by].
All the code of this project is licensed under the [Apache 2.0 License][apache] and all 3W Dataset's data files (Parquet files saved in subdirectories of the [dataset](dataset) directory) are licensed under the [Creative Commons Attribution 4.0 International License][cc-by].

## Versioning

Expand All @@ -111,7 +111,7 @@ To the best of its authors' knowledge, this is the first realistic and public da

## Structure

The 3W Dataset consists of all CSV files in the subdirectories of the [dataset](dataset) directory and structured as detailed [here](3W_DATASET_STRUCTURE.md).
The 3W Dataset consists of multiple Parquet files saved in subdirectories of the [dataset](dataset) directory and structured as detailed [here](3W_DATASET_STRUCTURE.md).

## Overview

Expand Down

0 comments on commit 5adc6e8

Please sign in to comment.