From de059f09a005f68b98f077b9a22947e9e5425153 Mon Sep 17 00:00:00 2001 From: Roman Joeres <70888826+Old-Shatterhand@users.noreply.github.com> Date: Wed, 15 Nov 2023 14:34:28 +0100 Subject: [PATCH] Update README.md --- README.md | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index eb2f0b9..8085a2b 100644 --- a/README.md +++ b/README.md @@ -10,16 +10,16 @@ [![downloads](https://anaconda.org/kalininalab/datasail/badges/downloads.svg)](https://anaconda.org/kalininalab/datasail) ![Python 3](https://img.shields.io/badge/python-3-blue.svg) -DataSAIL is a tool that splits data while minimizing the information leakage. This tool formulates the splitting of a -dataset as constrained minimization problem and computes the assignment of data points to splits while minimizing the +DataSAIL is a tool that splits data while minimizing Information Leakage. This tool formulates the splitting of a +dataset as a constrained minimization problem and computes the assignment of data points to splits while minimizing the objective function that accounts for information leakage. Internally, DataSAIL uses disciplined quasi-convex programming and binary quadratic programs to formulate the optimization task. DataSAIL utilizes solves like [SCIP](https://scipopt.org/), one of the fastest non-commercial -solvers for this type of problems, and [MOSEK](https://mosek.com), a commercial solver that distributes free licenses -for academic use. There are other options, please check the documentation. +solvers for this type of problem, and [MOSEK](https://mosek.com), a commercial solver that distributes free licenses +for academic use. There are other options; please check the documentation. -Apart from the here presented short overview, you can find a more detailed explanation of the tool on +Apart from the short overview, you can find a more detailed explanation of the tool on [ReadTheDocs](https://datasail.readthedocs.io/en/latest/index.html). ## Installation @@ -37,7 +37,7 @@ pip install grakel to install it into a new empty environment or ````shell -conda install -c conda-forge -c kalininalab -c bioconda -c mosek DataSAIL +mamba install -c conda-forge -c kalininalab -c bioconda -c mosek DataSAIL pip install grakel ```` @@ -48,23 +48,23 @@ DataSAIL is available from Python 3.8 and newer. ## Usage -DataSAIL is installed as a commandline tool. So, in the conda environment DataSAIL has been installed to, you can run +DataSAIL is installed as a command-line tool. So, in the conda environment, DataSAIL has been installed to, you can run ````shell datasail --e-type P --e-data --e-sim mmseqs --output --technique C1e ```` -to split a set of proteins that have been clustered using mmseqs. For a full list of arguments run `datasail -h` and +to split a set of proteins that have been clustered using mmseqs. For a full list of arguments, run `datasail -h` and checkout [ReadTheDocs](https://datasail.readthedocs.io/en/latest/index.html). ## When to use DataSAIL and when not to use -One can distinguish two main ways to train a machine learning model on biological data. -* Either the model shall be applied to data that is substantially different from the data to train on. In this case it - is important to have test cases that model this real world application scenario properly by being as dissimilar as +One can distinguish two main ways to train a machine-learning model on biological data. +* Either the model shall be applied to data substantially different from the data to train on. In this case, it + is essential to have test cases that correctly model this real-world application scenario by being as dissimilar as possible to the training data. -* Or the training dataset already covers the full space of possible samples shown to the model. +* Or the training dataset already covers the whole space of possible samples shown to the model. -DataSAIL is created to compute complex splits of the data by separating data based on similarities. This creates -complex data-splits for the first scenario. Therefore, use DataSAIL when your model is applied to data that is -different from your training data but not if the data in application is more or less the same as in the training. +DataSAIL is created to compute complex data splits by separating data based on similarities. This makes +complex data splits for the first scenario. So, you can use DataSAIL when your model is applied to data +different from your training data but not if the data in the application is more or less the same as in the training.