K-MELLODDY

Overview

This repository contains a Python-based preprocessing pipeline for K-MELLODDY standard data format, primarily designed for tasks in ADME/T prediction. The pipeline supports SMILES standardization, outlier detection, feature scaling, and label creation, making it suitable for classification and regression tasks. This repository will be updated when the K-MELLODDY standard data format is changed.

Features

SMILES Standardization:
- Removes salts, isotopes, and stereochemistry (optional).
- Standardizes tautomeric forms and calculates molecular scaffolds.
Label Processing:
- Handles binary, categorical, and continuous labels.
- Converts continuous labels into classification labels using thresholds.
- Scales experimental values using StandardScaler.
Outlier Detection:
- Supports multiple methods for outlier detection, including IQR, Local Outlier Factor (LOF), One-Class SVM, and Gaussian Mixture Models (GMM).
Task Validation:
- Ensures the dataset aligns with the selected task (classification or regression).
Customizable Parameters:
- Options to retain stereochemistry, remove salts, detect outliers, and handle duplicates.

Installation

Clone the repository:

git clone https://github.com/HITS-AI/k-melloddy.git
cd k-melloddy

Install the required dependencies: (will be added soon)
```
pip install -r requirements.txt
```

Usage

Example Code

from preprocessor import Preprocessor

# Initialize the Preprocessor
preprocessor = Preprocessor(
    input_path='data/chemical_data.csv',
    task='classification',
    task_name='solubility',
    smiles_column='SMILES_Structure_Parent',
    activity_column='Measurement_Value',
    remove_salt=True,
    keep_stereo=False,
    keep_duplicates=False,
    detect_outliers=True,
    threshold=50
)

# Run Preprocessing
processed_data = preprocessor.preprocess()

# Save the processed data
processed_data.to_csv('data/processed_chemical_data.csv', index=False)

Input File Format

The input file should be a CSV file with at least two columns:

SMILES Column: Contains SMILES strings of compounds (default: SMILES_Structure_Parent).
Activity Column: Contains numeric or categorical activity values (default: Measurement_Value).

Parameters

Parameter	Type	Description
`input_path`	`str`	Path to the input CSV file.
`task`	`str`	Task type: `classification` or `regression`.
`task_name`	`str`	Name of the task (e.g., `solubility`, `cyp1a2 inhibition`).
`smiles_column`	`str`	Name of the column containing SMILES strings.
`activity_column`	`str`	Name of the column containing activity values.
`remove_salt`	`bool`	Whether to remove salts from SMILES strings.
`keep_stereo`	`bool`	Whether to retain stereochemistry in SMILES strings.
`keep_duplicates`	`bool`	Whether to keep duplicate entries.
`detect_outliers`	`bool`	Whether to detect and remove outliers.
`threshold`	`float`	Threshold for converting continuous labels into classification labels (required for classification tasks).

Methods

Key Methods

preprocess(): Runs the complete preprocessing pipeline.
preprocess_compound(smiles): Processes a single SMILES string.
detect_outliers_statistical(): Identifies outliers using the Interquartile Range (IQR) method.
detect_outliers_density_based(): Identifies outliers using Local Outlier Factor (LOF).
detect_outliers_classification_based(): Identifies outliers using One-Class SVM.
scale_experiment_values(labels): Scales numeric activity values.

Dependencies

pandas
numpy
rdkit
scikit-learn
scipy

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

RDKit: For chemical informatics and machine learning tools.
scikit-learn: For machine learning algorithms and preprocessing utilities.

Contact

For questions or issues, please open an issue on the repository or contact the maintainer at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
data_sample.csv		data_sample.csv
hits-preprocess.py		hits-preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K-MELLODDY

Overview

Features

Installation

Usage

Example Code

Input File Format

Parameters

Methods

Key Methods

Dependencies

License

Acknowledgments

Contact

About

Releases

Packages

Languages

HITS-AI/k-melloddy

Folders and files

Latest commit

History

Repository files navigation

K-MELLODDY

Overview

Features

Installation

Usage

Example Code

Input File Format

Parameters

Methods

Key Methods

Dependencies

License

Acknowledgments

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages