-
Install docker. Install git if you're under windows as well.
You may need to configure Docker to let it use all available system memory if you're running this under Mac or Windows.
On Mac:
On Windows (click on the whale icon on the windows tray icons area):
On Windows, you will need to explicitly configure volume sharing across your host and the container:
Run the following commands to set the docker image up and train the model. Use a terminal in either Linux or Mac and Windows Power Shell on Windows.
-
Clone the repository, cd into the directory and build the docker image. Be sure you have write rights on the directory (ex: avoid clonning to c:\WINDOWS\system32):
git clone https://github.com/chembl/mmv_train_image.git cd mmv_train_image docker build -t chemblgroup/mmv_train .
-
In case you experience any kind of difficulties building the image behind a proxy, you can skip the 'docker build' command and instead of it run:
docker pull chemblgroup/mmv_train
-
Training dataset have to be named training_set.csv. It must copied in model_train folder and formatted like model_train/training_set_sample.csv.
-
Configure a model by editing the configuration file(model_train/model_configs.json) if needed. Available descriptors are:
- Fingerprints: Select between fcfp or ecfp. Set the radius.
- Physicochemical: alogp, mw, hba, hbd, rtb, n_h_atoms
-
Run the Docker container using the previously generated image. The container will map your host's model_train directory to the container's /model_train directory allowing the collection of all output files.
docker run -v /full/path/to/model_train:/model_train chemblgroup/mmv_train ex mac: docker run -v /Users/efelix/projects/mmv_train_image/model_train:/model_train chemblgroup/mmv_train ex linux: docker run -v /home/efelix/projects/mmv_train_image/model_train:/model_train chemblgroup/mmv_train ex windows: docker run -v C:\Users\efelix\projects\mmv_train_image\model_train:/model_train chemblgroup/mmv_train
The container will generate 5 files(description) for each model in model_train/outputs folder:
- modelX.json: the dump of the model.
- coverage_values_modelX.json: json file with a list of coverage values for each molecule in eMolecules dataset. Boxplots like the ones from the original publication will be generated with them.
- eMolecules_predictions_modelX.csv: eMolecules dataset predictions.
- internal_validation_report_modelX.json: classification metrics with training data using a 5 k-fold cross-validation.
- external_validation_report_modelX.json: classification metrics report with eMolecules predictions.
Send the required files to the EBI.