Skip to content

04. de novo assembly

Gustavo A. Silva-Arias edited this page Aug 25, 2021 · 5 revisions

Some external resources

Install software requirements using conda

We will use conda to install the software requirements for this practice. You can find a good tutorial for conda environments management here. To create the environment with all needed requirements use these commands:

source /opt/conda/etc/profile.d/conda.sh
conda create -n py39 python=3.9

After some minutes, you will see a long list of programs to be installed in the new environment and a confirmation message:

Proceed ([y]/n)? y

Then, hit enter.

Once the installation process is done you will see the following information in the terminal

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate py39 
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Now you can proceed to activate the created environment that contains the software ready to use:

cursobioinf01@admin:~$ conda activate py39 

If successful you should see the name of the environment in parenthesis at the beginning of the prompt line.

(py39) cursobioinf01@admin:~$ 

1. Chloroplast assembly

Aim: Assemble a chloroplast genome using short Illumina reads from whole-genome re-sequencing data.

1.1. Check the data

ls -l /home/curso/data/denovo_cp

You should see the following files:

all_plastome_genes.fasta
matK.fasta
ndhF.fasta
sample402_R1.fq.gz
sample402_R2.fq.gz
sample692_R1.fq.gz
sample692_R2.fq.gz
sample893_R1.fq.gz
sample893_R2.fq.gz

Create a new folder in your $HOME to store output files:

mkdir denovo1
cd denovo1

How was the data prepared: These read pairs come from total genomic libraries (WGS) of some plant species. An initial de novo assembly was performed on entire read sets using NOVOPlasty. To simplify the assembly for the practice, we mapped back the original WGS reads to this initial assembly and retained only the reads that mapped. In this way, we obtained a subset of reads that (hopefully) come only from the chloroplast.

1.2. Perform the de novo assembly

For this practice, instead of providing you with a ready-to-use command, you will have to construct your own spades.py command. You can check the multiple options in the website for the Spades manual or just by typing in the terminal:

spades.py

You must set a pair-end assembly, only execute the assembler (i.e. skip the error correction step), limit the number of processors to 4, and set the k-mer sizes manually.

Here you can find the most important options to construct your command to run spades:

-o <output_dir>             directory to store all the resulting files (required)
-1 <filename>               file with forward paired-end reads
-2 <filename>               file with reverse paired-end reads
--only-assembler            runs only assembling (without read error correction)
-k <int> [<int> ...]        list of k-mer sizes (must be odd and less than 128)
                            (e.g. 27,77,117,127)
-t <int>, --threads <int>   number of threads. [default: 16]

Optional task Try several k-mer sizes combinations and save each assembly in a different output folder. You may choose a single sample or work with the three samples. Since the assembly itself takes very little time you can try different options.

1.3. Visualize de assemblies

After your assemblies are finished, download the entire de novo folder to your local computer to visualize the results using Bandage. Download the Bandage and open it on your computer.

Load the .gfa file you got in the output directory:

And click the Draw graph button:

You can visualize the assembled contigs (color bars) and connections between contigs or nodes (thin black lines). In the example above, you can find a big assembled graph (two contig-bubbles linked by two linear contigs. The remaining small contigs in the bottom can be considered noisy fragments (not part of the chloroplast genome) and be discarded (select the nodes and press shift+del to completely remove them from the graph).

Now, we still have few small non-resolved portions of the assembly. First, you need to interpret the obtained assembly graph considering the expected structure of the chloroplast genome (see figure below). Chloroplast genomes (cpDNA) are circular and relatively conserved among land plants in terms of size, structure, and gene content. Most chloroplast genomes contain two inverted repeats (IRa and IRb) that divide the circular genome into a large single copy (LSC) region and a short single copy (SSC) region.

Chloroplast genome model. Source: Wikipedia

Then, you can select the low-coverage and unconnected nodes from the graphs (see example below), since those are more likely to result from erroneously retained reads similar to some fragments of the chloroplast.

After you removed the 'contamination' you can merge the nodes (Edit -> Merge all possible nodes) and obtain a resolved chloroplast genome. If you click again on Draw graph, you should obtain a graph similar to the figure below. You can verify the inverted repeat region has twice the depth of the single-copy regions.

If you want to resolve the full circular genome, you can proceed to duplicate the inverted repeat and select the proper paths using some known genes. Now, select the inverted repeat and duplicate the node (Edit -> Duplicate selected nodes). You can click and drag the nodes to see better the connections.

Now, using the built-in BLAST tool in Bandage you can use two chloroplast genes that indicate the proper direction of the genome. In this case, we can use the genes _mat_K and ndh_F, which are two proximal genes separated by the IRb. Click on the Create/view BLAST search. In the new window, click on the Build BLAST database -> Load from FASTA file and select the two provided fasta files in the denovo folder (ndhF.fasta and matK.fasta). Then, click on Run BLAST search and close the window.

Now, since we identified the positions of the matK gene (green) and the ndhF gene (blue), we can retain the paths that keep the two genes separated by one of the IRs. Then, we proceed to select the edges and exclude them with shift+del to completely remove them from the graph.

then, we have a fully resolved chloroplast genome

If you click on Draw graph again, you can get a nice circular assembly.


Here you can find youtube videos with an introduction to Bandage and basic controls produced by Ryan Wick the author of Bandage.


2. Basic transcriptome assembly

Aim: Compare two approaches for transcriptome assembly using Illumina reads

2.1 Check de data

The data for this practice include clean short read fastq files saved in the folder /home/curso/data/denovo_RNA.

Please do not copy the files. You can input the files without copying them by adding the full path.

2.2. Perform the de novo assembly with MEGAHIT

MEGAHIT is a fast and memory-efficient assembler. Although it is optimized for metagenome assemblly, it also works well on single genome assembly. The minimum requirements for MEGAHIT are the sequence reads provided with the options -1 and -2 and the comma-separated list of kmer size. In MEGAHIT, instead of the kmer size list, you can also provide the minimum and maximum kmer sizes (must be odd numbers) and an increment of kmer size of each iteration (must be an even number). See the example below:

Before running the command, two new bash tricks we can learn here:

  1. In bash the character \ means scape. Therefore, the five-line megahit command below will be interpreted as a single command in the terminal.
  2. DATADIR in the first command becomes an environment variable that stores the full (absolute) path of the directory containing the fastq files. We can use this variable to reduce the length of the commands, making a script easier to read. Every time we use $ next to the variable name, it will be interpreted as the string stored on it (the folder path in this case).
DATADIR=/home/curso/data/denovo_RNA/02_clean

megahit -1 $DATADIR/SRR2014835_R1.fastq.gz \
-2 $DATADIR/SRR2014835_R2.fastq.gz \
-o output_megahit \
--k-min 27 --k-max 127 --k-step 20 \
-t 2 -m 0.4

The MEGAHIT process with the configuration proposed in the example should take around 30-60 minutes to run. MEGAHIT outputs one main FASTA file named final.contigs.fa saved among other files and folders in the folder indicated with the -o option (output_megahit in the example above).

2.3. Perform the de novo assembly with rnaSPAdes

rnaSPAdes is a part of the SPAdes package. rnaSPAdes is a tool for de novo transcriptome assembly from RNA-Seq data (mainly Illumina RNA-Seq data). It also supports hybrid assembly from short and long reads (e.g. PacBio or Oxford Nanopore RNA reads).

The command line is very similar to the MEGAHIT. Following the manual, by default rnaSPAdes uses only 2 k-mer sizes, which are automatically detected using read length (approximately one-third and half of the maximal read length). They recommend not to change this parameter (especially smaller k-mer sizes) to avoid chimeric (misassembled) transcripts.

DATADIR=/home/curso/data/denovo_RNA/02_clean

spades.py --rna \
-1 $DATADIR/SRR2014835_R1.fastq.gz \
-2 $DATADIR/SRR2014835_R2.fastq.gz \
-o output_spades

This process could take between 1-2 hours!

rnaSPAdes outputs one main FASTA file named transcripts.fasta. In addition, you can find transcripts with different levels of filtration:

  • hard_filtered_transcripts.fasta – includes only long and reliable transcripts with high expression.
  • soft_filtered_transcripts.fasta – includes short and low-expressed transcipts, likely to contain junk sequences.

In this practice, we will evaluate the transcripts.fasta file.

2.4. Assembly evaluation/comparison

We now will proceed to compare the output assemblies obtained with MEGAHIT and rnaSPAdes. For this purpose, we can use stats.sh.

2.4.1. Run assembly evaluations

Now, we can proceed to run the processes for the assembly evaluations. We need to run one process for each assembled transcriptome.

For the MEGAHIT output:

stats.sh output_megahit/final.contigs.fa
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2918  0.2121  0.2093  0.2867  0.0000  0.0000  0.0000  0.4215  0.0591

Main genome scaffold total:             48862
Main genome contig total:               48862
Main genome scaffold sequence total:    54.487 MB
Main genome contig sequence total:      54.487 MB       0.000% gap
Main genome scaffold N/L50:             11137/1.582 KB
Main genome contig N/L50:               11137/1.582 KB
Main genome scaffold N/L90:             34660/496
Main genome contig N/L90:               34660/496
Max scaffold length:                    15.785 KB
Max contig length:                      15.785 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                 48,862          48,862      54,487,352      54,487,352   100.00%
    100                 48,862          48,862      54,487,352      54,487,352   100.00%
    250                 48,803          48,803      54,473,999      54,473,999   100.00%
    500                 34,487          34,487      48,958,003      48,958,003   100.00%
   1 KB                 20,240          20,240      38,782,673      38,782,673   100.00%
 2.5 KB                  3,769           3,769      12,796,513      12,796,513   100.00%
   5 KB                    242             242       1,486,697       1,486,697   100.00%
  10 KB                      3               3          38,387          38,387   100.00%

For the rnaSPAdes output:

stats.sh output_spades/transcripts.fasta
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2890  0.2080  0.2096  0.2934  0.0001  0.0000  0.0000  0.4176  0.0672

Main genome scaffold total:             87205
Main genome contig total:               87433
Main genome scaffold sequence total:    77.786 MB
Main genome contig sequence total:      77.776 MB       0.013% gap
Main genome scaffold N/L50:             15490/1.623 KB
Main genome contig N/L50:               15506/1.621 KB
Main genome scaffold N/L90:             53855/318
Main genome contig N/L90:               53931/318
Max scaffold length:                    15.792 KB
Max contig length:                      15.792 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                 87,205          87,433      77,786,055      77,775,785    99.99%
     50                 87,204          87,432      77,786,006      77,775,736    99.99%
    100                 87,155          87,383      77,782,893      77,772,623    99.99%
    250                 66,432          66,636      73,543,443      73,533,419    99.99%
    500                 42,138          42,315      65,429,285      65,419,566    99.99%
   1 KB                 28,047          28,194      55,140,005      55,131,273    99.98%
 2.5 KB                  5,711           5,764      19,464,127      19,460,245    99.98%
   5 KB                    384             393       2,359,054       2,358,154    99.96%
  10 KB                      6               6          72,027          72,027   100.00%

You can now make a table summarizing the results of the evaluation of the MEGAHIT and rnaSPAdes assemblies, compare and interpret the results.

Alternatively, you can use the software rnaQUAST. It provides informative reports using reference genome and gene databases. It allows performing de novo evaluation using several third-party tools, such as BUSCO and GeneMarkS-T.