This is an adaptation of the algorithm and code provide in https://github.com/daenuprobst/molzip.git.
The algorithm is also describe in the paper: Jiang, Zhiying, et al. “Less is More: Parameter-Free Text Classification with Gzip.” arXiv preprint arXiv:2212.09410 (2022).
Two methods are proposed, 1) is for the regression whereas the 2) is a PCA projection using only the smiles.
tested with python 3.9 with gzip, numpy and sklearn. Need pandas and datamol in addition to work.
Perform the regression using a simpler implementation of molzip.
from molzip_simple import regression
import pandas as pd
import matplotlib.pyplot as plt
from numpy.random import choice
from numpy import column_stack
# prepare the data, training and validation
df = pd.read_csv("solubility_dataset.csv")
smiles, sol_val = df['SMILES'], df['Solubility'].values
train_id = choice(range(df.shape[0]), int(df.shape[0]*0.8))
val_id = [i for i in range(df.shape[0]) if i not in train_id]
train_smiles, train_sol = smiles[train_id], sol_val[train_id]
val_smiles, val_sol = smiles[val_id], sol_val[val_id]
pred = regression(val_smiles, train_smiles, train_sol, k=10)
plt.scatter(val_sol, pred)
plt.show()
Perform a PCA projection using the distance matrix computed by molzip.
from molzip_simple import projection
import pandas as pd
import matplotlib.pyplot as plt
from numpy.random import choice
# prepare the data
df = pd.read_csv("solubility_dataset.csv")
smiles, sol_val = df['SMILES'], df['Solubility'].values
proj = projection(smiles)
plt.scatter(proj[:, 0], proj[:, 1], c=sol_val)
plt.show()
Create a molzip model object compatible with scikit-learn.
from molzip_simple import MolZipRegressor
import pandas as pd
import matplotlib.pyplot as plt
from numpy.random import choice
# prepare the data, training and validation
df = pd.read_csv("solubility_dataset.csv")
smiles, sol_val = df['SMILES'].values, df['Solubility'].values
train_id = choice(range(df.shape[0]), int(df.shape[0]*0.8))
val_id = [i for i in range(df.shape[0]) if i not in train_id]
train_smiles, train_sol = smiles[train_id], sol_val[train_id]
val_smiles, val_sol = smiles[val_id], sol_val[val_id]
# create & fit the model
mzr = MolZipRegressor(k=10)
mzr.fit(train_smiles, train_sol)
# make the prediction and vizualize the results
pred = mzr.predict(val_smiles)
plt.scatter(val_sol, pred)
plt.show()