SPAR
is a python module useful for the analysis of swine pathogen genomes.
SPAR
offers high accuracy sequence annotating based on custom built multiple sequence alignment (MSA) templates. Templates can be representative of virtually any kind of continuous genomic feature (ie coding region, gene, peptide) that can be translated into an amino acid sequence. Annotating uses regular expressions derived from template MSAs to rapidly identify the location of genomic features. Conditional further treatment involves alignment to a hidden Markov model (HMM) profile. The HMM profile alignment is used to accurately determine terminal endpoints, identify translational frameshifts, and locate indels.
Comparable annotation pipelines consist of using BLAST to identify highly similar parent sequences. Following pairwise alignment, annotations from the parent sequence/s are transfered to the query. By comparison, these methods are generally faster and sufficiently accurate when a large reference database is available. In the absence of similar comparison sequences, HMM profile alignment is a good alternative to pairwise alignment that pools available information to maximize alignment accuracy.
SPAR
also provides a classification method unique to Porcine reproductive and respiratory syndrome virus-2 (PRRSV-2) that assigns restriction fragment length polymorphism (RFLP) patterns. RFLP assignment is flexible; HMM profile alignment is used to compensate for indels in the query sequence that would impede pattern determination. No other comparable public resource is known.
OS X & Linux:
Install Python3.7
Unzip required.zip
Install necessary python packages:
pip install -r requirements.txt
Install MAFFT (version >= 7.310) dependency
Install BLAST (version >= 2.10.0+) dependency
Install HMMER3 (release >= 3.1b2) dependency
Rename include/settings_template.py to include/settings.py and make the following changes:
- Open bash command line terminal
- type
echo $PATH
into terminal and copy output - replace "None" in line
bash_path = None
with copied string (quoted)
Usage information is available via command line:
$ python3 run.py -h
usage: run.py [-h] [-v] {annotate,rflp} ...
positional arguments:
{annotate,rflp} Type of input (rflp or annotate). See specific help for more options based on input type chosen e.g. python run.py rflp --help
annotate Annotate genomic features. Currently supports CSFV, FMDV, PDCoV, PEDV, PRRSV-1, PRRSV-2, and SVA
rflp Assign RFLP value to PRRSV-2.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Detailed information for each command can be requested:
Genome annotation
$ python3 run.py annotate -h usage: run.py annotate [-h] [--limit LIMIT] [--organism ORGANISM] [--output OUTPUT] inputpositional arguments: input Input FASTA file name
optional arguments: -h, --help show this help message and exit --limit LIMIT Annotates only genomic features provided in comma delimited list --organism ORGANISM Possible organism identity of sequences in input FASTA file provided in comma delimited list --output OUTPUT Output gff3 file name
The --organism argument is used to specify the possible species identity for any sequence within the input FASTA file. Identification is based solely on BLAST similarity score. Additional pathogens may be added by creating a pathway under the requirements directory with the following structure: organism_abbreviation/hmm_profiles/msa_save. Add nucleotide MSA templates in FASTA file format to the msa_save directory. Templates can be representative of virtually any kind of continuous genomic feature that can be translated into an amino acid sequence. If a genomic feature contains a translational frameshift, adjust translational_frameshift_di
variable as described in include/settings_template.py. Once the necessary required templates have been provided, running the script for the first time will initiate a build process.
PRRSV-2 RFLP classification
$ python3 run.py rflp -h usage: run.py rflp [-h] [--full FULL] [--output OUTPUT] inputpositional arguments: input Input FASTA file name
optional arguments: -h, --help show this help message and exit --full FULL Only assign RFLP pattern to input sequences that are complete --output OUTPUT Output FASTA file name
Output file format will consist of the input FASTA file contents with RFLP pattern values appended at the end of each header, separated by a forward slash ("/") delimiter. If a sequence is not identified as being PRRSV-2, "na" will be appended. If a sequence does not contain a complete ORF5 gene, "null" will be appended. RFLP determination is only performed on complete genes be default. Use caution when enabling RFLP assignment to partial ORF5 sequences. Restriction sites cannot be inferred from missing residues.
Please contact the developer with any comments, concerns, or questions: [email protected]
Alternatively, post an issue in this GitHub repository for assistance.