Active learning for sites of metabolism prediction demonstrated with Zaretzki dataset. This data set was motified from the original data set to get the format of molecular structure (preprocessed using RDKit and ChEMBL Structue Pipeline) with all annotated sites of metabolism in one .sdf file (data/zaretzki_preprocessed.sdf).
- Code to calculate CDPKit FAME descriptor.
- src/features/cdpkit_calculate_fame_descriptors.py
- Example:
python3 src/features/cdpkit_calculate_fame_descriptors.py -i data/zaretzki_preprocessed.sdf -o output/ -r 5 -m
- Code to run active learning with random forest classifier.
- src/models/AL_for_SoM_pred.py
- Examples:
# split data to get 5-fold
python3 src/models/splits_for_AL.py
# active learning in 5-fold cross validation
python3 src/models/AL_for_SoM_pred.py -i output/zaretzki_r5_5folds_random_split.csv -o output/active_learning/01.random_sampling_vs_AL/ -ct 0.3 -af -n 5
# random selection in 5-fold cross validation
python3 src/models/AL_for_SoM_pred.py -i output/zaretzki_r5_5folds_random_split.csv -o output/active_learning/01.random_sampling_vs_AL/ -ct 0.3 -af -n 5 -rs
# repeat on one validation set for 5 times
python3 src/models/AL_for_SoM_pred.py -i output/zaretzki_r5_5folds_random_split.csv -o output/active_learning/01.random_sampling_vs_AL/ -ct 0.3 -tf 1 -n 5 -rfs
- Visualize results
- notebooks/zaretzki_active_learning_result.ipynb