TLRs - pathogen nucleic acid specificity prediction

To understand the dynamics of pathogen-specific host responses, we identified some key sequence, expression and function related features of nucleic acid sensing Toll-like receptor proteins of the host. Our findings suggest that such host-specific features are directly related to the strand (single or double) specificity of nucleic acid from pathogens. Therefore, we developed a model to predict the pathogen nucleic acid strand specificity of TLRs as follows.

RFC-LOO (Random Forest Classifier - Leave One Out) model

Training of model

The model is trained on 27 features of 129 TLR proteins of 16 species that includes sequence network evolutionary features, gene expression features and functional annotation features. All the sub features and their estimations are defined in the method section of publication (under review).

The estimated values of all the features are used to create a tab separated file called “training_data.txt”, located in the data folder and used to train the model with the below mentioned command by utilizing Training_model.py python script.

$ python3 Training_model.py data/training_data.txt doc

Here doc is the folder name that contains all the performance measure files.

The trained model also summarizes the important features of TLRs that could be critical for nucleic acid strand specificity prediction (brought up below).

Best model and its performance

The best trained model RFC-LOO_model.pkl is generated by the Training_model.py python script and saved in the bin folder for further use. The overall performance and the ROC curve represents that the RFC-LOO_model is trained well as described below and good enough to predict the pathogen nucleic acid specificity of TLRs.

The overall Accuracy of RFC-LOO model: 0.946
The overall Matthews correlation coefficient of RFC-LOO model: 0.903

Predicting TLR specificity

This repository provides the source code (specificity_prediction.py) for predicting nucleic acid strand specificity of pathogens sensed by uncharacterized TLRs. Our best trained RFC-LOO_model.pkl serves as a predictor to identify the specificity of novel and blind set TLRs (mentioned in “prediction_example.txt” file).

Conditions

The predictor was developed in python version 3 and above. It is necessary to install the sklearn, joblib, matplotlib and seanorn libraries to run the predictor.

Steps

Extract all the features and create a tab separated file as described in the format provided in the “prediction_example.txt” file.
Run the specificity_prediction.py python script using the following command

$ python3 Specificity_prediction.py data/<input_file>

Here, input_file is “prediction_example.txt” which contains aforementioned feature values for novel and blind set TLRs.

Predicted specificity for uncharacterized TLRs is shown below as well as written in the “TLR_specificity_prediction.tsv” file.

Novel set TLRs specificity prediction:
sTLR18	other
sTLR25a	other
sTLR25b	other
nTLR25	other
sTLR27	other
Blind set TLRs specificity prediction:
aTLR9	ss
aTLR18	other

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TLRs - pathogen nucleic acid specificity prediction

RFC-LOO (Random Forest Classifier - Leave One Out) model

Training of model

Best model and its performance

Predicting TLR specificity

Conditions

Steps

Files

README.md

Latest commit

History

README.md

File metadata and controls

TLRs - pathogen nucleic acid specificity prediction

RFC-LOO (Random Forest Classifier - Leave One Out) model

Training of model

Best model and its performance

Predicting TLR specificity

Conditions

Steps