Skip to content

3. Gene prediction and annotation

Frank Oliver Glöckner edited this page Mar 8, 2017 · 9 revisions

Table of contents

  1. Gene Prediction
  2. Gene annotation

Gene prediction

One of the main tasks in metagenomics is to find which potential functions are present in our samples. There are different approaches to achieve this goal, for example the most straightforward one we can do, is to perform a translated similarity search (blastx) against a protein database using the reads. However, this approach has some caveats: for example, our results will be biased to the content of the database; doing this search using all reads against a large database like the NCBI protein non-redundant (nr) is computationally really expensive; or we are losing the higher sensitivity and specificity power of using profile based methods for searching protein domain databases like PFAM. A good approach to avoid all this problems is to do an ab-initio gene prediction with tools designed specifically to deal with fragmented sequences like the ones we will find in short reads or assembled sequences. A good review about different gene prediction tools applied to short-reads and the particularities of this prediction can be found here

For our reads we will use [FragGeneScan (FGS)] (https://academic.oup.com/nar/article/38/20/e191/1317565/FragGeneScan-predicting-genes-in-short-and-error). FragGeneScan is used by MG-RAST and the EBI MG-Portal. Analysing OSD102 will take roughly 32 minutes using only one core in our BioLinux virtual machine. Nevertheless, we will learn using FragGeneScan with a smaller file which we will create.

First we need to transform the merged sequences from fastq to fasta to be able to be process the sequences using FragGeneScan. This is an easy task with one of the scripts present in BBtools:

reformat.sh in=OSD102.qc.fastq out=OSD102.qc.fasta

Next, we will extract 1000 reads and we will do the gene prediction:

mkdir FGS-test

cd FGS-test

reformat.sh in=OSD102.qc.fasta out=OSD102.qc.1000.fasta reads=1000

Now we can run FragGeneScan on these 1000 merged reads in fasta format:

run_FragGeneScan.pl -genome OSD102.qc.1000.fasta -out OSD102.1000 -complete 0 -train illumina_5

In case we would like to run the gene prediction on the whole dataset we should do:

run_FragGeneScan.pl -genome OSD102.qc.fasta -out OSD102 -complete 0 -train illumina_5

We can easily get some statistics on the predicted sequences using one tiny utility from Sean Eddy in SQUID

seqstat OSD102.faa

seqstat OSD102.ffn

We can count sequences easily using grep :

grep -c '>' OSD102.faa

How many genes does FGS predict? What is the mean length of the nucleotide sequences? and of the amino acid sequences?

Gene annotation

Now, that we identified the potential genes in our metagenomic samples, we can start to look for the functions they encode. Depending on the question(s) we have, we can use a more targeted approach. For example focusing on one single gene or taxonomic group, or we can do the annotation of the whole metagenomic sample. Depending on our computational resources, we can do it using our own facilities or we can use any of the public resources like MG-RAST or EBI Metagenomics Portal. Usually these resources use a variety of biological databases, which we can use too - in a local manner - depending on what we want to ask our metagenomic samples. Some of those databases are PFAM, KEGG, eggNOG, COG among others.

OSD102 has been already analysed at EBI-MG portal.

On the other hand, instead of fully annotating our metagenome, we can instead focus on a single function or on all functions of one specific organism. For example, we will use protein domains in combination with the HMMER tool to search for potential bacterial ammonia monooxygenase (AmoA) proteins in our metagenome. Additionally, we will look for proteins that might belong to the Prochlorococcus genus; we extracted all amino acids from the NCBI non-redundant database corresponding to the taxid 1218. We will use the blastx implemented in DIAMOND, a fast alternative to BLAST (up to 20,000X faster), although it has a lower sensitivity compared to BLAST.

We will start searching the AmoA domain in our predicted amino acid sequences. First we will download the HMM profile from the Pfam website:

wget http://pfam.xfam.org/family/PF05145/hmm -O AmoA.hmm

then we will use hmmsearch from HMMER to analyse our predicted amino acid sequences.

hmmsearch --cut_ga --domtblout OSD102.txt AmoA.hmm OSD102.faa

The parameters used for hmmer are --cut_ga, that means that HMMER will use the gathering threshold defined by the PFAM curators to decide if the match is significant or not; with --domtblout we specify the output format for HMMER, in this case we want a parseable table format with one row per-domain hit.

Here you can find a nice explanation about the differences using hmmscan or hmmsearch.

How many potential AmoA sequences are in our metagenomic sample? We easily can get the number of lines in the file using a combination of grep and wc.

Next, we will use DIAMOND to query our Prochlorococcus database using blastp and the amino acid sequences as query sequences. We would be able to use the quality trimmed reads and blastx to achieve similar results. Before using DIAMOND we need to create a database using the Prochlorococcus fasta file:

diamond makedb --in ~/../prochlo.fasta -d prochlo

The DIAMOND database, prochlo.dmnd will be created in this folder. Now we are ready to search the database using our amino acid sequences as query:

diamond blastp -e 1e-5 -d prochlo.dmnd -q ~/../OSD102.faa -a OSD102.prochlo

We set -e 1e-5 to report the hits with an e-value lower than 1e-25 . The output from the search (prochlo.daa) is in a DIAMOND alignment archive format. We can transform the file in a more human accessible format like the BLAST tabular format:

diamond view -a OSD102.prochlo.daa -o OSD102.prochlo.txt

The columns for the new file are:

 1. query
 2. subject
 3. % id
 4. alignment length
 5. mistmatches
 6. gap openings
 7. query start
 8. query end
 9. subject start
10. subject end
11. e-value
12. bit score

From the results, we can get the best hit for each of our sequences using a bit of awk magic:

awk '!a[$1]++' OSD102.prochlo.txt > OSD102.prochlo.bh.txt

Or we can be more strict and filter the best hits to have a lower e-value, let's say 1e-25:

awk '$11 <= 1e-25' OSD102.prochlo.bh.txt > OSD102.prochlo.bh.1e-25.txt

How many hits to Prochlorococcus proteins do we have? How many best hits? And how many have a lower e-value than 1e-25?