-
Notifications
You must be signed in to change notification settings - Fork 3
3. Gene prediction and annotation
One of the main tasks in metagenomics is to find which potential functions are present in our samples. There are different approaches to achieve this goal, for example the most straightforward one we can do, is to perform a translated similarity search (blastx) against a protein database using the reads. However, this approach has some caveats: for example, our results will be biased to the content of the database; doing this search using all reads against a large database like the NCBI protein non-redundant (nr) is computationally really expensive; or we are losing the higher sensitivity and specificity power of using profile based methods for searching protein domain databases like PFAM. A good approach to avoid all this problems is to do an ab-initio gene prediction with tools designed specifically to deal with fragmented sequences like the ones we will find in short reads or assembled sequences. A good review about different gene prediction tools applied to short-reads and the particularities of this prediction can be found here
For our reads we will use [FragGeneScan (FGS)] (https://academic.oup.com/nar/article/38/20/e191/1317565/FragGeneScan-predicting-genes-in-short-and-error). FragGeneScan is used by MG-RAST and the EBI MG-Portal. Analysing OSD102 will take roughly 32 minutes using only one core in our BioLinux virtual machine. Nevertheless, we will learn using FragGeneScan with a smaller file which we will create.
First we need to transform the merged sequences from fastq to fasta to be able to be process the sequences using FragGeneScan. This is an easy task with one of the scripts present in BBtools:
reformat.sh in=OSD102.qc.fastq out=OSD102.qc.fasta
Next, we will extract 1000 reads and we will do the gene prediction:
mkdir FGS-test
cd FGS-test
reformat.sh in=OSD102.qc.fasta out=OSD102.qc.1000.fasta reads=1000
Now we can run FragGeneScan on these 1000 merged reads in fasta format:
run_FragGeneScan.pl -genome OSD102.qc.1000.fasta -out OSD102.1000 -complete 0 -train illumina_5
In case we would like to run the gene prediction on the whole dataset we should do:
run_FragGeneScan.pl -genome OSD102.qc.fasta -out OSD102 -complete 0 -train illumina_5
We can easily get some statistics on the predicted sequences using one tiny utility from Sean Eddy in SQUID
seqstat OSD102.faa
seqstat OSD102.ffn
We can count sequences easily using grep :
grep -c '>' OSD102.faa
How many genes does FGS predict? What is the mean length of the nucleotide sequences? and of the amino acid sequences?
Now, that we identified the potential genes in our metagenomic samples, we can start to look for the functions they encode. Depending on the question(s) we have, we can use a more targeted approach. For example focusing on one single gene or taxonomic group, or we can do the annotation of the whole metagenomic sample. Depending on our computational resources, we can do it using our own facilities or we can use any of the public resources like MG-RAST or EBI Metagenomics Portal. Usually these resources use a variety of biological databases, which we can use too - in a local manner - depending on what we want to ask our metagenomic samples. Some of those databases are PFAM, KEGG, eggNOG, COG among others.
OSD102 has been already analysed at EBI-MG portal.
On the other hand, instead of fully annotating our metagenome, we can instead focus on a single function or on all functions of one specific organism. For example, we will use protein domains in combination with the HMMER tool to search for potential bacterial ammonia monooxygenase (AmoA) proteins in our metagenome. Additionally, we will look for proteins that might belong to the Prochlorococcus genus; we extracted all amino acids from the NCBI non-redundant database corresponding to the taxid 1218. We will use the blastx implemented in DIAMOND, a fast alternative to BLAST (up to 20,000X faster), although it has a lower sensitivity compared to BLAST.
We will start searching the AmoA domain in our predicted amino acid sequences. First we will download the HMM profile from the Pfam website:
wget http://pfam.xfam.org/family/PF05145/hmm -O AmoA.hmm
then we will use hmmsearch from HMMER to analyse our predicted amino acid sequences.
hmmsearch --cut_ga --domtblout OSD102.txt AmoA.hmm OSD102.faa
The parameters used for hmmer are --cut_ga, that means that HMMER will use the gathering threshold defined by the PFAM curators to decide if the match is significant or not; with --domtblout we specify the output format for HMMER, in this case we want a parseable table format with one row per-domain hit.
Here you can find a nice explanation about the differences using hmmscan or hmmsearch.
How many potential AmoA sequences are in our metagenomic sample? We easily can get the number of lines in the file using a combination of grep and wc.
Next, we will use DIAMOND to query our Prochlorococcus database using blastp and the amino acid sequences as query sequences. We would be able to use the quality trimmed reads and blastx to achieve similar results. Before using DIAMOND we need to create a database using the Prochlorococcus fasta file:
diamond makedb --in ~/../prochlo.fasta -d prochlo
The DIAMOND database, prochlo.dmnd
will be created in this folder. Now we are ready to search the database using our amino acid sequences as query:
diamond blastp -e 1e-5 -d prochlo.dmnd -q ~/../OSD102.faa -a OSD102.prochlo
We set -e 1e-5 to report the hits with an e-value lower than 1e-25 . The output from the search (prochlo.daa
) is in a DIAMOND alignment archive format. We can transform the file in a more human accessible format like the BLAST tabular format:
diamond view -a OSD102.prochlo.daa -o OSD102.prochlo.txt
The columns for the new file are:
1. query
2. subject
3. % id
4. alignment length
5. mistmatches
6. gap openings
7. query start
8. query end
9. subject start
10. subject end
11. e-value
12. bit score
From the results, we can get the best hit for each of our sequences using a bit of awk magic:
awk '!a[$1]++' OSD102.prochlo.txt > OSD102.prochlo.bh.txt
Or we can be more strict and filter the best hits to have a lower e-value, let's say 1e-25:
awk '$11 <= 1e-25' OSD102.prochlo.bh.txt > OSD102.prochlo.bh.1e-25.txt
How many hits to Prochlorococcus proteins do we have? How many best hits? And how many have a lower e-value than 1e-25?