Global AI Challenge solution Overall pipeline:
- data preprocessing (removed unneeded parts of molecules)
- generated Morgan, MACCS and Estate fingerprints
- applied MolCLR graph neurla network
- applied RandomForest to the features described before
- the models' results were merged and averaged
- the results from the previous point were also passed to the Lipinski rule checker
- notebooks - contains all the notebooks which were used during the analysis
- data - folder with all the data we used
- MolCLR - directory with MolCLR model
- YouGraphRF - directory with random forest model
- Run
data_preprocessing.ipynb
to make canonical SMILES - Run
ogb-rdk-transform.ipynb
to get preprocessed dataset - Go to
YouGraphRF
and runpython random_forest.py --smiles_file ... --smiles_test_file ...
- Take predictions from
rf_preds/rf_final_pred.npy
- Go to MolCLR
- Place preprocessed molecules data to
data/covid/COVID.csv
anddata/covid/COVID-test.csv
for train and test subsets correspondingly. - Run
python finetune_contrast.py
- Finally, run
predict-molclr.ipynb
. You need to change model path with your checkpoint. Or you can find checkpoint used for submission in finetune folder - The final predictions should be passed to
lipinski_rule_application.ipynb
You can find the requirements in requirements.txt file