Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Old-Shatterhand authored Nov 15, 2023
1 parent fea2e85 commit de059f0
Showing 1 changed file with 15 additions and 15 deletions.
30 changes: 15 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,16 @@
[![downloads](https://anaconda.org/kalininalab/datasail/badges/downloads.svg)](https://anaconda.org/kalininalab/datasail)
![Python 3](https://img.shields.io/badge/python-3-blue.svg)

DataSAIL is a tool that splits data while minimizing the information leakage. This tool formulates the splitting of a
dataset as constrained minimization problem and computes the assignment of data points to splits while minimizing the
DataSAIL is a tool that splits data while minimizing Information Leakage. This tool formulates the splitting of a
dataset as a constrained minimization problem and computes the assignment of data points to splits while minimizing the
objective function that accounts for information leakage.

Internally, DataSAIL uses disciplined quasi-convex programming and binary quadratic programs to formulate the
optimization task. DataSAIL utilizes solves like [SCIP](https://scipopt.org/), one of the fastest non-commercial
solvers for this type of problems, and [MOSEK](https://mosek.com), a commercial solver that distributes free licenses
for academic use. There are other options, please check the documentation.
solvers for this type of problem, and [MOSEK](https://mosek.com), a commercial solver that distributes free licenses
for academic use. There are other options; please check the documentation.

Apart from the here presented short overview, you can find a more detailed explanation of the tool on
Apart from the short overview, you can find a more detailed explanation of the tool on
[ReadTheDocs](https://datasail.readthedocs.io/en/latest/index.html).

## Installation
Expand All @@ -37,7 +37,7 @@ pip install grakel
to install it into a new empty environment or

````shell
conda install -c conda-forge -c kalininalab -c bioconda -c mosek DataSAIL
mamba install -c conda-forge -c kalininalab -c bioconda -c mosek DataSAIL
pip install grakel
````

Expand All @@ -48,23 +48,23 @@ DataSAIL is available from Python 3.8 and newer.

## Usage

DataSAIL is installed as a commandline tool. So, in the conda environment DataSAIL has been installed to, you can run
DataSAIL is installed as a command-line tool. So, in the conda environment, DataSAIL has been installed to, you can run

````shell
datasail --e-type P --e-data <path_to_fasta> --e-sim mmseqs --output <path_to_output_path> --technique C1e
````

to split a set of proteins that have been clustered using mmseqs. For a full list of arguments run `datasail -h` and
to split a set of proteins that have been clustered using mmseqs. For a full list of arguments, run `datasail -h` and
checkout [ReadTheDocs](https://datasail.readthedocs.io/en/latest/index.html).

## When to use DataSAIL and when not to use

One can distinguish two main ways to train a machine learning model on biological data.
* Either the model shall be applied to data that is substantially different from the data to train on. In this case it
is important to have test cases that model this real world application scenario properly by being as dissimilar as
One can distinguish two main ways to train a machine-learning model on biological data.
* Either the model shall be applied to data substantially different from the data to train on. In this case, it
is essential to have test cases that correctly model this real-world application scenario by being as dissimilar as
possible to the training data.
* Or the training dataset already covers the full space of possible samples shown to the model.
* Or the training dataset already covers the whole space of possible samples shown to the model.

DataSAIL is created to compute complex splits of the data by separating data based on similarities. This creates
complex data-splits for the first scenario. Therefore, use DataSAIL when your model is applied to data that is
different from your training data but not if the data in application is more or less the same as in the training.
DataSAIL is created to compute complex data splits by separating data based on similarities. This makes
complex data splits for the first scenario. So, you can use DataSAIL when your model is applied to data
different from your training data but not if the data in the application is more or less the same as in the training.

0 comments on commit de059f0

Please sign in to comment.