diff --git a/.gitignore b/.gitignore index 6996e69..5d2fa4a 100644 --- a/.gitignore +++ b/.gitignore @@ -138,6 +138,6 @@ tests/data/ # setuptools_scm version scikit_mol/_version.py -notebooks/sandbox.py .vscode notebooks/SLC6A4_active_excape_export.csv +sandbox/ diff --git a/CONTRIBUTION.md b/CONTRIBUTION.md index 3e39632..1b9aa61 100644 --- a/CONTRIBUTION.md +++ b/CONTRIBUTION.md @@ -5,7 +5,7 @@ Thanks for your interest in contributing to the project. Please read on in the s ## Slack channel We have a slack channel for communication, ask for an invite: esbenbjerrum+scikit_mol@gmail.com -It's not really active and Slack wan't to be paid now. Maybe we can use Discord instead. +It's not really active and Slack wan't to be paid now. Maybe we can use Discord instead as slack is now deleting old threads. ## Installation @@ -22,12 +22,13 @@ The projects transformers subclasses the BaseEstimator and Transformer mixin cla - The arguments accepted by **init** should all be keyword arguments with a default value. - Every keyword argument accepted by **init** should correspond to an attribute on the instance. -- - There should be no logic, not even input validation, and the parameters should not be changed. +- - There should be no logic, not even input validation, and the parameters should not be changed inside the **init** function. Scikit-learn classes depends on this in order to for e.g. the .get_params(), .set_params(), cloning abilities and representation rendering to work. +- With the new error handling, falsy objects need to return masked arrays or arrays with np.nan (for float dtype) ### Tips -- We have observed that some external tools used "exotic" types such at np.int64 when doing hyperparameter tuning. It is thus necessary to cast to standard types before making calls to rdkit functions. This behaviour is tested in the test_parameter_types test +- We have observed that some external tools used "exotic" types such at np.int64 when doing hyperparameter tuning. It is thus necessary do defensive programming to cast parameters to standard types before making calls to rdkit functions. This behaviour is tested in the test_parameter_types test - @property getters and setters can be used if additional logic are needed when setting the attributes from the keywords while at the same time adhering to the sklearn requisites. @@ -48,6 +49,7 @@ parameters and output of methods should preferably be using typehints ## Testing New transformer classes should be added to the pytest tests in the tests directory. A lot of tests are made general, and tests aspects of the transformers that are needed for sklearn compliance or other features. The transformer is then added to a fixture and can be added to the lists of transformer objects that are run by these test. Specific tests may also be necessary to set up. As exampe the assert_transformer_set_params needs a list of non-default parameters in order to set the set_params functionality of the object. +Scikit-Learn has a check_estimator that we should strive to get to work, some classes of scikit-mol currently does not pass all tests. ## Notebooks diff --git a/README.md b/README.md index d59717d..7c6255f 100644 --- a/README.md +++ b/README.md @@ -91,7 +91,12 @@ There are a collection of notebooks in the notebooks directory which demonstrate We also put a software note on ChemRxiv. [https://doi.org/10.26434/chemrxiv-2023-fzqwd](https://doi.org/10.26434/chemrxiv-2023-fzqwd) -## Contributing +## Roadmap and Contributing + +_Help wanted!_ Are you a PhD student that want a "side-quest" to procrastinate your thesis writing or are you simply interested in computational chemistry, cheminformatics or simply with an interest in QSAR modelling, Python Programming open-source software? Do you want to learn more about machine learning with Scikit-Learn? Or do you use scikit-mol for your current work and would like to pay a little back to the project and see it improved as well? +With a little bit of help, this project can be improved much faster! Reach to me (Esben), for a discussion about how we can proceed. + +Currently we are working on fixing some deprecation warnings, its not the most exciting work, but it's important to maintain a little. Later on we need to go over the scikit-learn compatibility and update to some of their newer features on their estimator classes. We're also brewing on some feature enhancements and tests, such as new fingerprints and a more versatile standardizer. There are more information about how to contribute to the project in [CONTRIBUTION.md](https://github.com/EBjerrum/scikit-mol/CONTRIBUTION.md) diff --git a/notebooks/01_basic_usage.ipynb b/notebooks/01_basic_usage.ipynb index 4c62abe..e254859 100644 --- a/notebooks/01_basic_usage.ipynb +++ b/notebooks/01_basic_usage.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "8a3e313c", + "id": "aa079ac3", "metadata": {}, "source": [ "# Scikit-Mol\n", @@ -13,7 +13,7 @@ }, { "cell_type": "markdown", - "id": "7bcbed23", + "id": "76d24789", "metadata": {}, "source": [ "The transformer classes are easy to load, configure and use to process molecular information into vectorized formats using fingerprinters or collections of descriptors. For demonstration purposes, let's load a MorganTransformer, that can convert a list of RDKit molecular objects into a numpy array of morgan fingerprints. First create some molecules from SMILES strings." @@ -22,13 +22,13 @@ { "cell_type": "code", "execution_count": 1, - "id": "f8025236", + "id": "2c8cad03", "metadata": { "execution": { - "iopub.execute_input": "2024-04-12T12:10:09.802220Z", - "iopub.status.busy": "2024-04-12T12:10:09.802030Z", - "iopub.status.idle": "2024-04-12T12:10:09.808949Z", - "shell.execute_reply": "2024-04-12T12:10:09.808440Z" + "iopub.execute_input": "2024-11-24T09:27:16.292725Z", + "iopub.status.busy": "2024-11-24T09:27:16.292083Z", + "iopub.status.idle": "2024-11-24T09:27:16.306663Z", + "shell.execute_reply": "2024-11-24T09:27:16.304935Z" } }, "outputs": [], @@ -39,32 +39,34 @@ { "cell_type": "code", "execution_count": 2, - "id": "58a33f4d", + "id": "8d5b2333", "metadata": { "execution": { - "iopub.execute_input": "2024-04-12T12:10:09.811277Z", - "iopub.status.busy": "2024-04-12T12:10:09.811060Z", - "iopub.status.idle": "2024-04-12T12:10:09.936897Z", - "shell.execute_reply": "2024-04-12T12:10:09.936201Z" + "iopub.execute_input": "2024-11-24T09:27:16.313611Z", + "iopub.status.busy": "2024-11-24T09:27:16.313028Z", + "iopub.status.idle": "2024-11-24T09:27:16.510254Z", + "shell.execute_reply": "2024-11-24T09:27:16.509620Z" } }, "outputs": [], "source": [ "from rdkit import Chem\n", "\n", - "smiles_strings = [\"C12C([C@@H](OC(C=3C=CC(=CC3)F)C=4C=CC(=CC4)F)CC(N1CCCCCC5=CC=CC=C5)CC2)C(=O)OC\", \n", - "\"O(C1=NC=C2C(CN(CC2=C1)C)C3=CC=C(OC)C=C3)CCCN(CC)CC\",\n", - "\"O=S(=O)(N(CC=1C=CC2=CC=CC=C2C1)[C@@H]3CCNC3)C\",\n", - "\"C1(=C2C(CCCC2O)=NC=3C1=CC=CC3)NCC=4C=CC(=CC4)Cl\",\n", - "\"C1NC[C@@H](C1)[C@H](OC=2C=CC(=NC2C)OC)CC(C)C\",\n", - "\"FC(F)(F)C=1C(CN(C2CCNCC2)CC(CC)CC)=CC=CC1\"]\n", + "smiles_strings = [\n", + " \"C12C([C@@H](OC(C=3C=CC(=CC3)F)C=4C=CC(=CC4)F)CC(N1CCCCCC5=CC=CC=C5)CC2)C(=O)OC\",\n", + " \"O(C1=NC=C2C(CN(CC2=C1)C)C3=CC=C(OC)C=C3)CCCN(CC)CC\",\n", + " \"O=S(=O)(N(CC=1C=CC2=CC=CC=C2C1)[C@@H]3CCNC3)C\",\n", + " \"C1(=C2C(CCCC2O)=NC=3C1=CC=CC3)NCC=4C=CC(=CC4)Cl\",\n", + " \"C1NC[C@@H](C1)[C@H](OC=2C=CC(=NC2C)OC)CC(C)C\",\n", + " \"FC(F)(F)C=1C(CN(C2CCNCC2)CC(CC)CC)=CC=CC1\",\n", + "]\n", "\n", "mols = [Chem.MolFromSmiles(smiles) for smiles in smiles_strings]" ] }, { "cell_type": "markdown", - "id": "0228c878", + "id": "b9a588c7", "metadata": {}, "source": [ "Next we import the Morgan fingerprint transformer" @@ -73,13 +75,13 @@ { "cell_type": "code", "execution_count": 3, - "id": "cdb821a1", + "id": "0a625dda", "metadata": { "execution": { - "iopub.execute_input": "2024-04-12T12:10:09.939980Z", - "iopub.status.busy": "2024-04-12T12:10:09.939552Z", - "iopub.status.idle": "2024-04-12T12:10:10.505528Z", - "shell.execute_reply": "2024-04-12T12:10:10.504885Z" + "iopub.execute_input": "2024-11-24T09:27:16.513123Z", + "iopub.status.busy": "2024-11-24T09:27:16.512856Z", + "iopub.status.idle": "2024-11-24T09:27:17.089043Z", + "shell.execute_reply": "2024-11-24T09:27:17.088357Z" } }, "outputs": [ @@ -100,10 +102,10 @@ }, { "cell_type": "markdown", - "id": "e8ebae67", + "id": "355610d1", "metadata": {}, "source": [ - "It actually renders as a cute little interactive block in the Jupyter notebook and lists the options that are not the default values. If we print it, it also gives the information on the settings. \n", + "It actually renders as a cute little interactive block in the Jupyter notebook and lists the options that are not the default values. If we print it, it also gives the information on the settings.\n", "\n", "![An image of the interactive transformer widget](images/Transformer_Widget.jpg \"Transformer object rendering in Jupyter\")\n", "\n", @@ -113,20 +115,424 @@ { "cell_type": "code", "execution_count": 4, - "id": "c3a24f4e", + "id": "9a801d0f", "metadata": { "execution": { - "iopub.execute_input": "2024-04-12T12:10:10.508400Z", - "iopub.status.busy": "2024-04-12T12:10:10.508055Z", - "iopub.status.idle": "2024-04-12T12:10:10.514636Z", - "shell.execute_reply": "2024-04-12T12:10:10.514117Z" + "iopub.execute_input": "2024-11-24T09:27:17.091942Z", + "iopub.status.busy": "2024-11-24T09:27:17.091571Z", + "iopub.status.idle": "2024-11-24T09:27:17.098501Z", + "shell.execute_reply": "2024-11-24T09:27:17.097922Z" } }, "outputs": [ { "data": { "text/html": [ - "
MorganFingerprintTransformer(radius=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
MorganFingerprintTransformer(radius=3)
MorganFingerprintTransformer(radius=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
MorganFingerprintTransformer(radius=3)