The objective of the cloud-mask library is to process and run ML algorithms at the pixel level. It is based on the only publicly available dataset of manually labelled Sentinel-2 scenes for cloud masks. It gathers 97 tiles with more than 5 million manually labelled pixels, curated by Hollstein et al. The database is provided by EnMAP and available on the gitlab repository Database File of Manually classified Sentinel-2A Data.
A report summarizing the findings and results is available here.
To install the package, you must first clone the git repository to the desired folder
git clone [email protected]:j-desloires/s2-pixel-detection.git
Then, open Anaconda prompt and create the environment from environment.yml
cd cloud-mask
conda env create -f environment.yml
conda activate cloud-mask
pip install .
If you want to run the jupyter notebooks, you must set the environment name:
ipython kernel install --user --name=cloud-mask
This package has a dependency with
- eo-flow python module developped by Sinergise. You must clone and install this library on your environment.
- adapt python module developed by Michelin. The method CNN has the class attributes self.encoder, self.task and self.discriminator to be compatible with the library.
# Load and prepare experiments
from cloudmask.data_load import load_file, experiments
# Perform the experiments
## Load the prepared train and test dataset
from cloudmask.data_load.utils_exp import load_test_data, load_train_data, subset_data
## Hyperparameters tuning given a standard model with .fit and .predict methods
from cloudmask.base_models.base_ml_tuning import get_predictions, get_dictionary_metrics
## LGBM with focal loss for multiclass classification (one-vs-rest)
from cloudmask.ml_task.ml_models import OneVsRestLightGBMWithCustomizedLoss, FocalLoss
## CNN1D applied on the spectrum dimension
from cloudmask.ml_task import convnet_models
You can find out how to call the different modules to run the experiments from python scripts in the folder /examples/scripts. It is assumed that you have already downloaded the data from this link into the examples/data folder on the repo. The scripts are organized to run all the experiments as follows:
- Data loading
- Load the data and group pixels per tile. All the pixels will be saved into a np.array of (N, 13) dimensions.
- Prepare the experiments for ML task to split the data into a training, validation and test sets given a geographical scale (tile, country or continent).
- Explore the data to compute descriptive statistics here.
- Run standard ml algorithms (e.g. RandomForest) with random_search hyperparameters with leave-one-group-out cross-validation here.
- Run LightGBM with Focal loss to take the imbalanced multiclass problem here.
- Run CNN1D models. You can play with the hyperparameters given the dictionary here.
- Load results for ml algorithms. The metrics were already saved during the training phase here. We focus here on LightGBM with focal loss.
- Load results for deep learning algorithms. We consider the model that gave the best metric on the validation sets for a given test set (must be improved) here.
Then you can test the scripts by changing the root_path to where the repo is saved locally.