Skip to content

Creation of a MolZipRegressor class for scikit-learn compatibility

Notifications You must be signed in to change notification settings

Doctopya/molzip_adapted_class

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Overview

This is an adaptation of the algorithm and code provide in https://github.com/daenuprobst/molzip.git.

The algorithm is also describe in the paper: Jiang, Zhiying, et al. “Less is More: Parameter-Free Text Classification with Gzip.” arXiv preprint arXiv:2212.09410 (2022).

Two methods are proposed, 1) is for the regression whereas the 2) is a PCA projection using only the smiles.

Requirements

tested with python 3.9 with gzip, numpy and sklearn. Need pandas and datamol in addition to work.

Regression using gzip

Perform the regression using a simpler implementation of molzip.

from molzip_simple import regression
import pandas as pd
import matplotlib.pyplot as plt
from numpy.random import choice
from numpy import column_stack

# prepare the data, training and validation
df = pd.read_csv("solubility_dataset.csv")
smiles, sol_val = df['SMILES'], df['Solubility'].values
train_id = choice(range(df.shape[0]), int(df.shape[0]*0.8))
val_id = [i for i in range(df.shape[0]) if i not in train_id]
train_smiles, train_sol = smiles[train_id], sol_val[train_id]
val_smiles, val_sol = smiles[val_id], sol_val[val_id]

pred = regression(val_smiles, train_smiles, train_sol, k=10)

plt.scatter(val_sol, pred)
plt.show()

Projection using gzip

Perform a PCA projection using the distance matrix computed by molzip.

from molzip_simple import projection
import pandas as pd
import matplotlib.pyplot as plt
from numpy.random import choice

# prepare the data
df = pd.read_csv("solubility_dataset.csv")
smiles, sol_val = df['SMILES'], df['Solubility'].values

proj = projection(smiles)

plt.scatter(proj[:, 0], proj[:, 1], c=sol_val)
plt.show()

MolZipRegressor usage

Create a molzip model object compatible with scikit-learn.

from molzip_simple import MolZipRegressor
import pandas as pd
import matplotlib.pyplot as plt
from numpy.random import choice

# prepare the data, training and validation
df = pd.read_csv("solubility_dataset.csv")
smiles, sol_val = df['SMILES'].values, df['Solubility'].values
train_id = choice(range(df.shape[0]), int(df.shape[0]*0.8))
val_id = [i for i in range(df.shape[0]) if i not in train_id]
train_smiles, train_sol = smiles[train_id], sol_val[train_id]
val_smiles, val_sol = smiles[val_id], sol_val[val_id]

# create & fit the model
mzr = MolZipRegressor(k=10)
mzr.fit(train_smiles, train_sol)

# make the prediction and vizualize the results
pred = mzr.predict(val_smiles)
plt.scatter(val_sol, pred)
plt.show()

About

Creation of a MolZipRegressor class for scikit-learn compatibility

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%