Sustainable Industry: Rinse Over Run data competition.
We settled on a gradient boosted tree to solve the problem. We chose CatBoost's implementation of the algorithm, as we have found it to be provide generally better results than XGBoost.
Put raw data files in data/raw
:
recipe_metadata.csv
submission_format.csv
test_values.zip
train_labels.csv
train_values.zip
Install Docker, then pull our Docker image: docker pull contiamo/schneider
data_processing.ipynb
(train/test split and truncation of phases)feature_engineering.ipynb
(calculation of timeseries features)catboost/best_model.ipynb
(training the model)
Specifically:
docker run --rm -e CHOWN_HOME=yes -v "$PWD":/home/jovyan/work contiamo/schneider papermill /home/jovyan/work/notebooks/data_processing.ipynb /home/jovyan/work/notebooks/data_processing.output.ipynb
docker run --rm -e CHOWN_HOME=yes -v "$PWD":/home/jovyan/work contiamo/schneider papermill /home/jovyan/work/notebooks/feature_engineering.ipynb /home/jovyan/work/notebooks/feature_engineering.output.ipynb
docker run --rm -e CHOWN_HOME=yes -v "$PWD":/home/jovyan/work contiamo/schneider papermill /home/jovyan/work/notebooks/catboost/best_model.ipynb /home/jovyan/work/notebooks/catboost/best_model.output.ipynb
The resulting submission will be in data/
.
In order to run the notebooks interactively:
docker run --rm -p 127.0.0.1:8888:8888 -e CHOWN_HOME=yes -v "$PWD":/home/jovyan/work contiamo/schneider jupyter lab --NotebookApp.token=''
Then open your browser at this address: http://localhost:8888. If the port is already in use, change it in the -p
flag above.
This can be particularly interesting for the last, modeling notebook.
Note: In order to visualize progression during model training, the notebook has to viewed in the "classic" view: Help > Launch Classic Notebook
.
We split the original data into the following datasets, available as parquet files:
train_ts_truncated
: 70% of original train data, with final_rinse removed, and truncated to match phase distribution in hidden test set;test_ts_truncated
: 30% of original train data, with final_rinse removed, and truncated to match phase distribution in hidden test set;train_target
: target value for processes intrain_ts_truncated
;test_target
: target value for processes intest_ts_truncated
.