GitHub - akshayayadav/undercl-detection-correction: Container for under-clustering detection and correction of gene-famiiies

Tool for detecting and correcting under-clustered gene families

Tool for detecting and correcting under-clustered gene families using a sequence-pair-classification-based method, along with all the prerequisite packages. Briefly, the method first trains and tests HMM models, based on sequence pairs, from a given family sequences and the candidate missing sequences(a. k. a. closest non-family sequences) for the family, to determine the separation between the family sequences and the non-family sequences. Classification statistics between the family and non-family sequences are then used to predict missing sequences for the family, from the list of candidate missing sequences.

Steps for executing the analysis

This application can be executed in 2 modes: global and subset. The global mode is used to predict missing sequences for a given family by searching in all the rest of the families. On the other hand, the subset mode is used to predict missing sequences from a non-overlapping subset of families (usually small families that could be merged into larger families). Further, both types of modes can be executed with 2 prediction modes: F1-score and F2-score. The F1-score mode is the conservative mode for prediction of missing sequences that places equal weights on precision and recall while predicting missing sequences. The F2-score mode favors recall over precision for predicting missing sequences and thus can predict more missing sequences than the F1-score mode for the same family. If the user is confident that the given families are highly accurate with low probability of them containing "wrong" or unrelated sequences, F2-score mode can be used for correcting the families. Else, if the user is not confident about the accuracy of the families, the F1-score mode is the safest to avoid attracting more unrelated sequences to the families.

Downloading the container

docker pull akshayayadav/undercl-detection-correction

Preparing the data

Prepare a data directory <my_directory> with a user defined name containing a directory named family_fasta and a fasta file from which to search and predict missing sequences for the families. Fasta files for all the families to be analyzed should be placed in the family_fasta directory. For global mode, the fasta file should be named proteomes.fa and MUST contain sequences from all the families including the families present in the family_fasta directory. For subset mode, the fasta file should be named unclustered.fa and MUST NOT contain any sequences from the families present in family_fasta directory.

Running the analysis

Global mode with F1-score function and <n> cores

docker run -v <absolute_path_to_data_directory>:/data akshayayadav/undercl-detection-correction run_analysis_global-F1.sh -c <n>

Global mode with F2-score function and <n> cores

docker run -v <absolute_path_to_data_directory>:/data akshayayadav/undercl-detection-correction run_analysis_global-F2.sh -c <n>

Subset mode with F1-score function and <n> cores

docker run -v <absolute_path_to_data_directory>:/data akshayayadav/undercl-detection-correction run_analysis_subset-F1.sh -c <n>

Subset mode with F2-score function and <n> cores

docker run -v <absolute_path_to_data_directory>:/data akshayayadav/undercl-detection-correction run_analysis_subset-F2.sh -c <n>

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Docker		Docker
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tool for detecting and correcting under-clustered gene families

Steps for executing the analysis

Downloading the container

Preparing the data

Running the analysis

About

Releases

Packages

Languages

akshayayadav/undercl-detection-correction

Folders and files

Latest commit

History

Repository files navigation

Tool for detecting and correcting under-clustered gene families

Steps for executing the analysis

Downloading the container

Preparing the data

Running the analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages