MUBD-DecoyMaker^syn: Making Synthetic Maximal Unbiased Benchmarking Datasets via Deep Reinforcement Learning

Introduction

MUBD-DecoyMaker^syn is a brand-new computational software to make synthetic Maximal Unbiased Benchmarking Datasets (MUBD^syn) for in silico screening. Compared with our earlier two versions, i.e. MUBD-DECOYMAKER (Pipeline Pilot-based version, or MUBD-DecoyMaker 1.0) and MUBD-DecoyMaker 2.0, MUBD-DecoyMaker^syn has two noteworthy features:

Virtual molecules generated by recurrent neural netwrok (RNN)-based molecular generator with reinforcement learning (RL), instead of chemical library molecules, constitue the unbiased decoy set (UDS) component of MUBD.
The criteria (or rule) for an ideal decoy previously defined in the earlier versions are integrated into a new scoring function for RL to fine-tune the generator.

Below is how to make MUBD^syn with MUBD-DecoyMaker^syn.

Requirements

As REINVENT is used to make virtual decoys of MUBD^syn, users are required to install this tool as instructed (this repository holds the version 3.2 of REINVENT, Copyright 2021 Atanas Patronov, licensed under the Apache 2.0 license). The corresponding conda environment named reinvent is created for virtual decoy generation. Please note we also modified and use the source code from reinvent-chemistry and reinvent-scoring, Copyright 2021 Atanas Patronov, licensed under the Apache 2.0 license, here in order to include our scoring functions specific for MUBD^syn. Another conda environment named MUBD is also created for preprocessing and postprocessing.

Clone this repository and navigate to it:

$ git clone https://github.com/Sooooooap/MUBDsyn.git
$ cd MUBDsyn

Create the conda environment called reinvent:

$ cd Reinvent
$ conda env create -f reinvent.yml

Create the conda environment called MUBD:

$ cd ../MUBD
$ conda env create -f MUBD.yml

Usage

PR_agonist from NRLiSt BDB is used as a test case to demonstrate how to make MUBD^syn with MUBD-DecoyMaker^syn. All the test files are in the directory of ./resources/case.

Build the unbiased ligand set (ULS^syn)

Run build_uls.py to process the raw ligand set. This script takes the raw ligands in SMILES representation as input and puts out the unbiased ligand set (Diverse_ligands.csv). Four more files regarding ligand properties, i.e. Diverse_ligands_PS.csv, Diverse_ligands_PS_maxmin.csv, Diverse_ligands_sims_maxmin.txt and Diverse_ligands_len.txt, are also generated and stored in the directory of ./MUBD/output/ULS/.

IMPORTANT: Ligand curation, including molecule standardization, salt removal and protonization at a specific range of pH (implemented by Dimorphite-DL), is required if the ligands are not curated. For ligand curation, we provide the --curate option for build_uls.py.

$ cd ./MUBD
$ conda activate MUBD
(MUBD) $ python build_uls.py --i ../resources/case/PR_agonists.csv --curate

Generate the potential decoy set

mk_config.py writes out the configurations for the generation of MUBD^syn virtual decoys. We provide gen_decoys.sh to iterate over all the ligands and set the configurations specific for each of them. Notice that this step may take more than ten hours for this case with the default configuration.

$ chmod +x ./gen_decoys.sh
$ conda activate reinvent
(reinvent) $ ./gen_decoys.sh

Build the unbiased decoy set (UDS^syn)

The file within the directory of ./MUBD/output/UDS/auto_train/ligand_*/results/scaffold_memory.csv contains the potential decoy set specific for the ligand_*. The potential decoy set is refined by SMILES curation and structural clustering (script: curating_clustering.py). Then the unbiased decoys for each ligand were annotated with the properties and merged (script: merge_decoys.py) to consitute the final decoy set. We provide build_uds.sh to automatically run the above mentioned scripts.

$ chmod +x ./build_uds.sh
$ conda activate MUBD
(MUBD) $ ./build_uds.sh

MUBD^syn is stored in the root directory, i.e. MUBDsyn_ULS.csv and MUBDsyn_UDS.csv.

Validation

Users can perform a quick validation on the generated MUBD^syn with four basic metrics. We provide ./MUBD/validate.py to perform the validation and store the results in the directory of ./MUBD/validation/results/:

$ cd ./MUBD
$ conda activate MUBD
(MUBD) $ python validate.py

The comprehensive validations performed in the paper can be reproduced according to the scripits and notebooks provided in the directory of ./resources/validation_paper/. All relevant datasets are available at Zenodo.

Validaiton	Notebooks/Scripts	Datasets
Internal Validation	int_val_figs.ipynb int_val_tabs.ipynb	datasets_int_val
External Validation (classical_VS)	ext_val_classical_VS_figs.ipynb ext_val_classical_VS_tabs.ipynb ext_val_SI_classical_VS_figs.ipynb ext_val_SI_classical_VS_tabs.ipynb	datasets_ext_val_classical_VS datasets_ext_val_SI_classical_VS
External Validation (ML_VS*)	ext_val_ML_VS_AVEbias.py ext_val_ML_VS_AVEbias_plt_MUBDreal.ipynb ext_val_ML_VS_AVEbias_plt_MUBDsyn.ipynb	datasets_ext_val_ML_VS

*Benchmark results of three ML models are available at ext_val_ML_VS_benchmark.

Acknowledgements

We thank the authors of REINVENT REINVENT 2.0: An AI Tool for De Novo Drug Design for making REINVENT open to the community. Our work is based on this computational tool. Please consider citing their work if you use MUBD-DecoyMaker^syn in your research.

We also appreciate the developers of Dimorphite-DL. We use that computational tool to protonate raw actives. It is highly recommended to cite their publication Dimorphite-DL: an open-source program for enumerating the ionization states of drug-like small molecules, if you use it in your work.

Citation

If you use MUBD^syn or related materials, please cite:

Shen, T.; Li, S.; Wang, X. S.; Wang, D.; Wu, S.; Xia, J.; Zhang, L., Deep Reinforcement Learning Enables Better Bias Control in Benchmark for Virtual Screening. Comput. Biol. Med. 2024, 108165.

or BibTex:

@article{SHEN2024108165,
title = {Deep Reinforcement Learning Enables Better Bias Control in Benchmark for Virtual Screening},
journal = {Computers in Biology and Medicine},
pages = {108165},
year = {2024},
issn = {0010-4825},
doi = {https://doi.org/10.1016/j.compbiomed.2024.108165},
url = {https://www.sciencedirect.com/science/article/pii/S001048252400249X},
author = {Tao Shen and Shan Li and Xiang Simon Wang and Dongmei Wang and Song Wu and Jie Xia and Liangren Zhang}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MUBD-DecoyMaker^syn: Making Synthetic Maximal Unbiased Benchmarking Datasets via Deep Reinforcement Learning

Introduction

Requirements

Usage

Build the unbiased ligand set (ULS^syn)

Generate the potential decoy set

Build the unbiased decoy set (UDS^syn)

Validation

Acknowledgements

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

MUBD-DecoyMakersyn: Making Synthetic Maximal Unbiased Benchmarking Datasets via Deep Reinforcement Learning

Introduction

Requirements

Usage

Build the unbiased ligand set (ULSsyn)

Generate the potential decoy set

Build the unbiased decoy set (UDSsyn)

Validation

Acknowledgements

Citation

MUBD-DecoyMaker^syn: Making Synthetic Maximal Unbiased Benchmarking Datasets via Deep Reinforcement Learning

Build the unbiased ligand set (ULS^syn)

Build the unbiased decoy set (UDS^syn)