To understand the dynamics of pathogen-specific host responses, we identified some key sequence, expression and function related features of nucleic acid sensing Toll-like receptor proteins of the host. Our findings suggest that such host-specific features are directly related to the strand (single or double) specificity of nucleic acid from pathogens. Therefore, we developed a model to predict the pathogen nucleic acid strand specificity of TLRs as follows.
The model is trained on 27 features of 129 TLR proteins of 16 species that includes sequence network evolutionary features, gene expression features and functional annotation features. All the sub features and their estimations are defined in the method section of publication (under review).
The estimated values of all the features are used to create a tab separated file called “training_data.txt”, located in the data folder and used to train the model with the below mentioned command by utilizing Training_model.py python script.
$ python3 Training_model.py data/training_data.txt doc
Here doc is the folder name that contains all the performance measure files.
The trained model also summarizes the important features of TLRs that could be critical for nucleic acid strand specificity prediction (brought up below).
The best trained model RFC-LOO_model.pkl is generated by the Training_model.py python script and saved in the bin folder for further use. The overall performance and the ROC curve represents that the RFC-LOO_model is trained well as described below and good enough to predict the pathogen nucleic acid specificity of TLRs.
The overall Accuracy of RFC-LOO model: 0.946
The overall Matthews correlation coefficient of RFC-LOO model: 0.903
This repository provides the source code (specificity_prediction.py) for predicting nucleic acid strand specificity of pathogens sensed by uncharacterized TLRs. Our best trained RFC-LOO_model.pkl serves as a predictor to identify the specificity of novel and blind set TLRs (mentioned in “prediction_example.txt” file).
The predictor was developed in python version 3 and above. It is necessary to install the sklearn, joblib, matplotlib and seanorn libraries to run the predictor.
-
Extract all the features and create a tab separated file as described in the format provided in the “prediction_example.txt” file.
-
Run the specificity_prediction.py python script using the following command
$ python3 Specificity_prediction.py data/<input_file>
Here, input_file is “prediction_example.txt” which contains aforementioned feature values for novel and blind set TLRs.
- Predicted specificity for uncharacterized TLRs is shown below as well as written in the “TLR_specificity_prediction.tsv” file.
Novel set TLRs specificity prediction: | |
---|---|
sTLR18 | other |
sTLR25a | other |
sTLR25b | other |
nTLR25 | other |
sTLR27 | other |
Blind set TLRs specificity prediction: | |
aTLR9 | ss |
aTLR18 | other |