A compact pipeline that allows for searching selected domains (based on Pfam HMM models) in proteins encoded in genomic sequences.
To set up the environment properly you need to have Miniconda installed on linux-64 platform. The pipeline was tested on Ubuntu 22.04 using conda 23.7.4. Once you are ready with that, you need to create a conda environment pfam-gen
described in envs/pfam-gen.txt
:
conda create --name pfam-gen --file envs/pfam-gen.txt
Next, activate the environment:
conda activate pfam-gen
Using test_data.sh
script, you may download 20 staphylococcal genomic sequences from NCBI GenBank to genomes
directory, as well as Pfam-A.hmm
file to test the pipeline on a set of 10 CAT (catalytic) and 5 CWT (cell-wall-targeting) domains (input/domains.tsv
) that occurr in bacterial peptidoglycan hydrolases. Simply run the script in the pipeline directory:
./test_data.sh
The pipeline utilises the following directory structure:
your_pipeline_location/
├── Snakefile
├── config.yaml
├── input
│ ├── colors.tsv
│ ├── domains.tsv
│ ├── IUPACDNA.txt
│ └── TABLE11.txt
├── src
│ ├── build.sh
│ └── extractorfs.cpp
├── templates
│ ├── style.css
│ └── tmpl.html
├── scripts
│ ├── annot.py
│ ├── archchart.py
│ ├── extractfasta.py
│ ├── extractorfs
│ ├── extractpfm.sh
│ ├── filter.py
│ ├── lib
│ │ └── fasta.py
│ ├── preprocess.py
│ └── unique.py
├── log
└── output
In the working directory you can find the Snakefile
describing the pipeline and config.yaml
, in which paths to Pfam-A.hmm
and the directory with genomic sequences as well as domain architectures of interes are provided. In input/
there are files describing nucleotide sequence alphabet and translation table to be used (IUPACDNA.txt
, TABLE11.txt
) as well as groups of domains and domains being sarched for next to colors these groups and domains are to be painted with in a visualisation of all possible architectures that are found in searched genomes (domains.tsv
, colors.tsv
). In src/
a source file extractorfs.cpp
for scripts/extractorfs
is placed. If build.sh
is run from that directory, the application will be recompiled and saved as scripts/extractorfs
. In templates/
CSS and HTML templates are located. These are used to generate a HTML visualisation of all possible architectures that are found in searched genomes, which next is converted to a PNG file. Necessary scripts and one compiled application are located in scripts/
. Directories output/
and log/
will be created automatically once the pipeline is run. All diagnostic and error messages from tools and scripts used by the pipeline will be redirected to files in the log/
directory.
The pipline described in the Snakefile encompasses the following stages:
- extractorfs -- using
scripts/extractorfs
, provided DNA alphabet (input/IUPACDNA.txt
) and translation table (input/TABLE11.txt
), from each genome extract all posisble open reading frames (ORFs) of lenght >= 300 nt. - uniquetrans -- using
scripts/unique.py
, cluster quickly extracted protein sequences based on their 100% identity to prepare a non-redundant set for HMM searches. - hmmfetch -- using
hmmfetch
tool, fetch domains, fromPfam-A.hmm
file, that are listed ininput/domains.tsv
and their Pfam accession version numbers are provided inpfam_acc
column of that file. - hmmsearch -- using
hmmsearch
tool, search for retrived domains in the non-redundant protein sequence set. - preprocess -- using
scripts/preprocess.py
, preprocess the rawhmmsearch
results and save relevant columns. - archchart -- using
scripts/archchart.py
, branch to generate charts in HTML format that describe all domain architectures found, also in regard to groups of domains (columngroup
ininput/domains.tsv
). - convertchart -- using
wkhtmltoimage
tool, continue the branch to convert charts in HTML format to PNG. - filter -- using
scripts/filter.py
, continue the main branch and from preprocessed hits select those of independent E-value (i-Evalue) <= 0.001 and domain coverage >= 80% (0.8). Next from target protein sequences select those with domain architecure of interest (config.yaml
). - extracttrans -- using
scripts/extractfasta
, extract finally selected sequences from the non-redundant set. - signalp -- using independently installed SignalP 5.0 tool available as
signalp
, detect N-terminal signal sequences in the final set of protein sequences for each domain architecture of interest. If the tool is not available, the step generates an empty result file and is skipped. - annotdom -- using
scripts/annot.py
, prepare a GFF3 file with annotations for the finally selected protein sequences. Annotations are prepared based onhmmsearch
filtered results andsignalp
results.
More detailed description on how the pipeline works you will find in comments in the Snakefile
, config.yaml
file and the script/source files.
Providing that you have pfam-gen
environment properly set up and activated as well as at least the test data dowloaded, you can first look up the list of tasks to be done by running the following command from the pipeline directory, where Snakefile
is located:
snakemake --dryrun --quiet
and then run the pipeline using as many cores as you wish:
snakemake --cores number_of_cores
Final data, non-redundant protein sequences and corresponding annotations for each domain architecture of interest, as well as HTML and PNG visualisatons of all domain architectures found, you will find in output/final
directory.