ChemTS-torch is a PyTorch implementation based on previous ChemTS works, including ChemTSv21(https://github.com/molecule-generator-collection/ChemTSv2), ChemTS2(https://github.com/tsudalab/ChemTS) and MPChemTS3(https://github.com/yoshizoe/mp-chemts).
New features:
- Generative model implemented by PyTorch;
- Multi-GPU support via PyTorch Lightning;
- Both SMILES4 and SELFIES5 available for string-based molecule generation.
- Transformer available as the generative model for better generation quality.
cd <YOUR PATH>
git clone https://github.com/tsudalab/ChemTS-torch
cd ChemTS-torch
pip install -r requirements.txt
export PYTHONPATH=<YOUR PATH>/ChemTS-torch
Two pretrained models are provided in the pretrained/
folder. They are trained on the 250k ZINC data with SMILES and SELFIES, respectively.
You can also follow the steps to train a custom RNN/Transformer molecule generative model:
- Prepare a molecule data file in smiles format. If you want to train a SELFIES predictor, set it in the config file and it will be automatically converted during training.
- Prepare a configuration file to set parameters for training the model.
- Run the commands.
cd train_model
python train_RNN.py --config model_setting.yaml
- The checkpoint file of the model with the highest validaiton accuracy will be saved to the path which is set in the configuration file. The checkpoint file is used to reload the trained model for the molecule generation process.
Here is an example configuration file model_setting.yaml
.
Data:
dataset: ../data/250k_rndm_zinc_drugs_clean.smi # path to the smiles file
format: smiles # string-based molecule representation: smiles or selfies
output_model_dir: pretrained/smiles_zinc250k # directory to save model checkpoints
output_token: pretrained/smiles_zinc250k/smiles_tokens.txt # path to save tokens
seq_len: 73 # maximum length of the token sequences, automatically calculated
vocab_len: 65 # size of the token vocabulary, automatically calculated
Model:
type: transformer # which generative model to use: rnn or transformer
dropout_rate: 0.1 # dropout rate
hidden_dim: 512 # number of hidden features
### if rnn is used: ###
n_layer: 2 # number of recurrent layers
### if transformer is used: ###
embed_dim: 128 # number of embedding features
n_heads: 8 # number of attention heads
n_layer: 6 # number of transformer blocks
Seed: 123 # random seed
Train:
accelerator: gpu # cpu, gpu for training
batch_size: 512 # batch size of data
decay_alpha: 0.01 # decay rate of the optimizer scheduler
decay_steps: 100 # decay steps of the optimizer scheduler
device: 3 # which gpu to use, for example, cuda:3
epoch: 1000 # training epochs
gradient_clip: 2.0 # value for gradient clipping
learning_rate: 0.001 # learning rate
num_workers: 12 # number of workers for the dataloader
optimizer: adam # optimizer, adam or adamw
patience: 50 # patience for early stopping
scheduler: CosineAnnealingLR # optimizer scheduler
validation_split: 0.1 # data split ratio for validation
python run.py --config config/setting.yaml
For reward function setting, please check the information in the ChemTSv2 repository.
The example configuration file for generation is given in config/setting.yaml
. The path to the checkpoint produced by training has to be provided. Options/functions are kept unchanged as the ChemTSv2 work, and can be checked here.
Footnotes
-
Ishida, S., Aasawat, T., Sumita, M., Katouda, M., Yoshizawa, T., Yoshizoe, K., Tsuda, K. and Terayama, K., 2023. ChemTSv2: Functional molecular design using de novo molecule generator. Wiley Interdisciplinary Reviews: Computational Molecular Science, p.e1680. ↩
-
Yang, X., Zhang, J., Yoshizoe, K., Terayama, K. and Tsuda, K., 2017. ChemTS: an efficient python library for de novo molecular generation. Science and technology of advanced materials, 18(1), pp.972-976. ↩
-
Yang, X., Aasawat, T.K. and Yoshizoe, K., 2020. Practical massively parallel monte-carlo tree search applied to molecular design. arXiv preprint arXiv:2006.10504. ↩
-
Weininger, D., 1988. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1), pp.31-36. ↩
-
Krenn, M., Häse, F., Nigam, A., Friederich, P. and Aspuru-Guzik, A., 2020. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4), p.045024. ↩