De novo design generates virtual molecules from scratch. It filters structures generated using several scoring functions and assesses synthetic chemical feasibility to remove reactive and unrealistic compounds. De novo design based on generative algorithms such as deep learning involves using neural networks and databases with many compounds. Public databases such as the Collection of Open NatUral ProdUcTs and the Universal Natural Product database (UNPD) are rich sources of structures to be used in generative models.
This repository had supplementary information from the original paper (Natural products subsets: generation and characterization) that include: the MaxMin algorithm and structural diversity implemented in Python language; the interactive TMPAs; the new REAL-Enamine subset, and the three subsets generated from UNPD with 14,994, 7,497, and 4,998 compounds with stereochemical information.
A significant perspective of this work is that the natural product subsets derived from the UNPD can be used to develop generative models that use deep learning algorithms and require the most diverse compounds, such as de novo design. The natural products subsets can also be used to develop predictive models; for virtual screening; and reference databases for evaluating the structural diversity or similarity to a specific subset, among other applications.
Please, cite our paper:
Chávez-Hernández, A. L., & Medina-Franco, J. L. (2023). Natural products subsets: Generation and characterization. Artificial Intelligence in the Life Sciences, 3, 100066.https://doi.org/10.1016/j.ailsci.2023.100066