This pipeline is designed by functional annotation team of group 1, to fully annotate biological functions of genes/proteins predicted from the results of gene prediction team with ab initio and homology-based tools.
Except for EggNOG, other functional tools take less than 15 mins, so we are going to run all 50 genomes with each tool.
For EggNOG, we used CD-HIT to cluster the .faa files, and run EGGNOG on the centroid file. From the EggNOG result, we
map to 50 genomes based on the cluster file generated by CD_HIT
After having all results annotated by different tools for each genomes, we merged all gffs into one gff file for each genome.
1. python3 and python2 to run EggNOG annotation
2. perl for TMHMM2.0 tool. The installed tool has perl path /usr/bin/perl.
If the perl path is different from your local computer, please change it to /usr/bin/perl or make a symbolic link.
3. These tools are needed to be included in the current working directory with the FA_pipeline_final.py script, as in this git repo:
- tmhmm-2.0c/bin/tmhmm
- pilercr1.06/pilercr
- signalp-5.0/bin/signalp
- ncbi-blast-2.8.1+/bin containing blasts, blasts, makeblastdb
eggnog-mapper
folder containing emapper.py.
4. These databases are also needed to be present in the current working directory, to run annotation for DOOR2 and VFDB:
- operonDB.fasta
- VFDBdb.fasta
card-data
folder: a database to runcard
annotation
5. Utilities
folder, containing scripts to run with the FA_pipeline_final.py script
6. cd-hit
tool is installed or added to executable path. cd-hit
can be installed with conda using the command:
conda install -c bioconda cd-hit
orconda install -c bioconda/label/cf201901 cd-hit
Or user can download the source code ofcd-hit
following the instruction in this website: https://github.com/weizhongli/cdhit/wiki/2.-Installation
7. rgi
tool needs to be installed before running the pipeline. Users can install rgi
source code at this website: https://github.com/arpcard/rgi. Or rgi
can be installed using conda
. For this pipeline, rgi
version is 4.2.2. The command to install rgi
is:
- conda install -c conda-forge filetype
- conda install -c conda-forge pyahocorasick
- conda install -c conda-forge -c bioconda rgi=4.2.2
- conda install -c biocore blast-plus
git clone https://github.gatech.edu/compgenomics2019/Team1-FunctionalAnnotation.git
cd Team1-FunctionalAnnotation
./FA_pipeline_final.py -p <path for .faa files> -n <path for .fna files> -a <path for .fasta files> -oa<output name for annotation results> -og< output name for gff files> -om <output name for final results>
OR
python3 FA_pipeline_final.py -p <path for .faa files> -n <path for .fna files> -a <path for .fasta files> -oa<output name for annotation results> -og< output name for gff files> -om <output name for final results>
-p
: directory that contains all the .faa files with known gene or RNA sequences from gene prediction
-a
: directory that contains all the .faa files with known protein sequences from gene prediction
-n
:a directory that contains all the .fasta files from assembly genome
-oa
: name of folder that contains annotation results
-og
: name of folder that contains converted gff results
-om
: name of folder that contains 50 off merged file
- There will be a total of 3 output folders after executing the pipeline command. In addition to the final output specified by
-om
argument, there will be another two intermediate folders ,specified by-oa
and-og
arguments. The final output is a .gff file associated with each genome - In addition to the 3 output folders, there is another folder named
cluster-CDHIT
. This folder contains the centroid file, and another file describing cluster genes in each centroid. - When blast path is executed, it will also generate two database folders:
blast_operon_db
: database for DOOR2blast_vfdb_db
: database for vfdb