Updated README and cleaned up some files

EBjerrum · Oct 16, 2022 · ebb19e2 · ebb19e2
1 parent 0187a27
commit ebb19e2
Show file tree

Hide file tree

Showing 4 changed files with 60 additions and 112 deletions.
diff --git a/README.md b/README.md
@@ -1,42 +1,74 @@
 # scikit-mol
-scikit-learn classes for molecular vectorization using RDKit
 
+Scikit-Learn classes for molecular vectorization using RDKit
 
-TODO:
-    Expand number of fingerprint classes and featurizers
-        AtomPairs
-        TopologicalTorsions
-        RDKit
-        Descriptors
-        LINGOS
-        ...
+The intended usage is to be able to add molecular vectorization directly into scikit-learn pipelines, so that the final model directly predict on RDKit molecules or SMILES strings
 
-    Make dictionary based FP class
-        No Hashing, .fit() learns the keys of the dataset
+As example with the needed scikit-learn and -mol imports and RDKit mol objects in the mol_list_train and _test lists:
 
-    Make a basic standardarizer transformer class
+    pipe = Pipeline([('mol_transformer', MorganTransformer()), ('Regressor', Ridge())])
+    pipe.fit(mol_list_train, y_train)
+    pipe.score(mol_list_test, y_test)
+    pipe.predict([Chem.MolFromSmiles('c1ccccc1C(=O)C')])
 
-    Make a SMILES to Mol transformer class
+    >>> array([4.93858815])
 
-Make Notebook with examples
-    Standalone usage
-    Inclusion in pipeline
-        Can transformers be used in parallel (e.g. to use both FP features and Descriptors at the same time?)
-    Hyperparameter optimization via native Scikit-Classes
-    Hyperparameter optimization via external optimizer e.g. https://scikit-optimize.github.io/stable/
+The scikit-learn compatibility should also make it easier to include the fingerprinting step in hyperparameter tuning with scikit-learns utilities
 
+The first draft for the project was created at the [RDKIT UGM 2022 hackathon](https://github.com/rdkit/UGM_2022) 2022-October-14
 
-Make basic unit-tests
 
+## Implemented
+* Transformer Classes
+    * SmilesToMol
+    * Desc2DTransformer
+    * MACCSTransformer
+    * RDKitFPTransformer
+    * AtomPairFingerprintTransformer
+    * TopologicalTorsionFingerprintTransformer
+    * MorganTransformer
+<br>
+<br>
+* Utilities
+    * CheckSmilesSanitazion
+
+## Installation
+Users can install latest tagged release from pip
+
+    pip install scikit-mol
+
+Bleeding edge
+
+    pip install git+https://github.com:EBjerrum/scikit-mol.git
+
+Developers 
 
-Installation
     git clone [email protected]:EBjerrum/scikit-mol.git
     pip install -e .
 
+## Documentation
+None yet, but there are some # %% delimted examples in the notebooks directory that have some demonstrations
+
+## BUGS
+Probably still
+
+
+## TODO
+* Make standardizer less 'chatty'
+* Unit test coverage of classes
+* Make further example notebooks
+    * Standalone usage (not in pipeline)
+    * Advanced pipelining
+    * Hyperparameter optimization via external optimizer e.g. https://scikit-optimize.github.io/stable/
+
+## Ideas
+* LINGOS transformer
 
 
-Contributers:
-    Esben Bjerrum, [email protected]
-    Son Ha, [email protected]
-    Oh-hyeon Choung, [email protected]
-    Please add yourself here, we'll properly markdown it later
+## Contributers:
+* Esben Bjerrum, [email protected]
+* Carmen Esposito https://github.com/cespos
+* Son Ha, [email protected]
+* Oh-hyeon Choung, [email protected]
+* Andreas Poehlmann, https://github.com/ap--
+* Ya Chen, https://github.com/anya-chen
diff --git a/notebooks/sandbox.py b/notebooks/sandbox.py
@@ -6,15 +6,15 @@
 
 
 #%%
-from scikit_mol.smilestomol import SmilesToMol
+from scikit_mol.transformers import SmilesToMol
 smiles_list = ['c1ccccc1'] * 10
 smilestomol = SmilesToMol()
 mols = smilestomol.fit_transform(smiles_list)
 mols[0]
 
 
 #%%
-from scikit_mol.smilestomol import SmilesToMol
+from scikit_mol.transformers import SmilesToMol
 smiles_list = ['c1ccccc1'] * 10
 y = list(range(10))
 y.append(1000)
@@ -42,31 +42,6 @@
 mols[0]
 
 
-
-#%%
-y_out = []
-X_out = []
-y_error = []
-X_error = []
-
-for smiles, y_value in zip(smiles_list, y):
-    mol = Chem.MolFromSmiles(smiles)
-    if mol:
-        X_out.append(mol)
-        y_out.append(y_value)
-    else:
-        print(f'Logging: Error in parsing {smiles}')
-        X_error.append(smiles)
-        y_error.append(y_value)
-
-print(X_out)
-print(y_out)
-print(X_error)
-print(y_error)
-
-
-
-
 #%%
 X= [Chem.MolFromSmiles('c1ccccc1')]*10
 t = MorganTransformer(useCounts=True)

diff --git a/scikit_mol/smilestomol.py b/scikit_mol/smilestomol.py
diff --git a/standardizer.py b/standardizer.py