Overview

This is an adaptation of the algorithm and code provide in https://github.com/daenuprobst/molzip.git.

The algorithm is also describe in the paper: Jiang, Zhiying, et al. “Less is More: Parameter-Free Text Classification with Gzip.” arXiv preprint arXiv:2212.09410 (2022).

Two methods are proposed, 1) is for the regression whereas the 2) is a PCA projection using only the smiles.

Requirements

tested with python 3.9 with gzip, numpy and sklearn. Need pandas and datamol in addition to work.

Regression using gzip

Perform the regression using a simpler implementation of molzip.

from molzip_simple import regression
import pandas as pd
import matplotlib.pyplot as plt
from numpy.random import choice
from numpy import column_stack

# prepare the data, training and validation
df = pd.read_csv("solubility_dataset.csv")
smiles, sol_val = df['SMILES'], df['Solubility'].values
train_id = choice(range(df.shape[0]), int(df.shape[0]*0.8))
val_id = [i for i in range(df.shape[0]) if i not in train_id]
train_smiles, train_sol = smiles[train_id], sol_val[train_id]
val_smiles, val_sol = smiles[val_id], sol_val[val_id]

pred = regression(val_smiles, train_smiles, train_sol, k=10)

plt.scatter(val_sol, pred)
plt.show()

Projection using gzip

Perform a PCA projection using the distance matrix computed by molzip.

from molzip_simple import projection
import pandas as pd
import matplotlib.pyplot as plt
from numpy.random import choice

# prepare the data
df = pd.read_csv("solubility_dataset.csv")
smiles, sol_val = df['SMILES'], df['Solubility'].values

proj = projection(smiles)

plt.scatter(proj[:, 0], proj[:, 1], c=sol_val)
plt.show()

MolZipRegressor usage

Create a molzip model object compatible with scikit-learn.

from molzip_simple import MolZipRegressor
import pandas as pd
import matplotlib.pyplot as plt
from numpy.random import choice

# prepare the data, training and validation
df = pd.read_csv("solubility_dataset.csv")
smiles, sol_val = df['SMILES'].values, df['Solubility'].values
train_id = choice(range(df.shape[0]), int(df.shape[0]*0.8))
val_id = [i for i in range(df.shape[0]) if i not in train_id]
train_smiles, train_sol = smiles[train_id], sol_val[train_id]
val_smiles, val_sol = smiles[val_id], sol_val[val_id]

# create & fit the model
mzr = MolZipRegressor(k=10)
mzr.fit(train_smiles, train_sol)

# make the prediction and vizualize the results
pred = mzr.predict(val_smiles)
plt.scatter(val_sol, pred)
plt.show()

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
molzip_simple.py		molzip_simple.py
readme.org		readme.org
solubility_dataset.csv		solubility_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Requirements

Regression using gzip

Projection using gzip

MolZipRegressor usage

About

Releases

Packages

Languages

Doctopya/molzip_adapted_class

Folders and files

Latest commit

History

Repository files navigation

Overview

Requirements

Regression using gzip

Projection using gzip

MolZipRegressor usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages