Releases: swansonk14/chemfunc
TPSA
SDF to SMARTS and setup.py to pyproject.toml
SDF to SMARTS
Added the function sdf_to_smarts
, which behaves just like sdf_to_smiles
but it converts an SDF file to SMARTS instead of SMILES. This required refactoring the code by creating a new convert_sdf
function that is called by both sdf_to_smiles
and sdf_to_smarts
with different parameters.
setup.py to pyproject.toml
Refactored the code base to replace the setup.py
(using setuptools) with pyproject.toml
(using hatchling).
PAINS filter
Adding filters for PAINS and other unwanted substructures as the "pains_plus" property in chemfunc compute_properties
.
MCS denominator options
The maximum common substructure (MCS) similarity calculation in molecular_similarities.py
now has additional options for the denominator used in the MCS similarity calculation. The denominator can be specified with --denominator <denominator>
where <denominator>
is one of the following three options.
mol_1
: similarity = (MCS size) / (number of atoms in mol_1)
mol_2
: similarity = (MCS size) / (number of atoms in mol_2)
avg
: similarity = 0.5 * [(MCS size) / (number of atoms in mol_1) + (MCS size) / (number of atoms in mol_2)]
The previous definition was mol_2
so this is the default.
Improved MCS and Regression to Classification
MCS
The maximum common substructure (MCS) similarity function in molecular_similarities.py
now accepts additional parameters for modifying the MCS calculation. Specifically, it now allows for match_valences
, ring_matches_ring_only
, and complete_rings_only
(see https://www.rdkit.org/docs/source/rdkit.Chem.MCS.html). These are also accessible via the command line when running chemfunc nearest_neighbor
.
Regression to Classification
The regression_to_classification.py
script now includes a delete_class_indices
flag to delete certain class indices. The primary use case is for building binary classification datasets with a gap between the active and inactive categories. For example, setting thresholds = [0.4, 0.6]
and delete_class_indices = {1}
will label data < 0.4 as 0 and data >= 0.6 as 1 (originally labeled 2) and will delete data in between 0.4 and 0.6 (originally labeled 1).
SDF to SMILES Properties
SDF to SMILES
This release primarily modifies the sdf_to_smiles
function. Previously, the user had to specify which properties they wanted extracted from the SDF file (along with the SMILES) using the properties
flag. That option still remains, but now the user can alternatively request that all properties are extracted from the SDF (with the all_properties
flag). Additionally, the user can now specify the name of the column in the CSV file that will contain smiles using the smiles_column
flag (previously it was hard-coded to "smiles").
Morgan fingerprints
Additionally, this release changes the fingerprint type of Morgan fingerprints from bool
to np.float32
. This brings it in line with the RDKit fingerprints, which are already np.float32
. Additionally, this will help avoid issues with using Morgan fingerprints in ML models that expect float-type vectors.
Fixing t-SNE
This release fixes an issue in the plot_tsne.py
script, where the TSNE
object was initialized with a now deprecated parameter called square_distances
. This parameter has now been removed.
Save Fingerprints Script
Added a save_fingerprints
script to compute fingerprints (RDKit or Morgan) from the SMILES in a CSV file and save them as an NPZ file. The computation is done in parallel for speed.
Also fixed a version issue between scipy
and descriptastorus
.
Minor Fixes
Fixing RDKit fingerprints with NumPy version >= 1.24.0 and fixing metrics for nearest neighbor scripts.
Fixing SA Score Import
Fixing SA score import.