-
Notifications
You must be signed in to change notification settings - Fork 2
04. de novo assembly
Some external resources
- Tools and example commands for a de novo assembly pipeline
- Good and concise lecture in youtube
We will use conda to install the software requirements for this practice. You can find a good tutorial for conda environments management here. To create the environment with all needed requirements use these commands:
source /opt/conda/etc/profile.d/conda.sh
conda create -n py39 python=3.9
After some minutes, you will see a long list of programs to be installed in the new environment and a confirmation message:
Proceed ([y]/n)? y
Then, hit enter.
Once the installation process is done you will see the following information in the terminal
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate py39
#
# To deactivate an active environment, use
#
# $ conda deactivate
Now you can proceed to activate the created environment that contains the software ready to use:
cursobioinf01@admin:~$ conda activate py39
If successful you should see the name of the environment in parenthesis at the beginning of the prompt line.
(py39) cursobioinf01@admin:~$
Aim: Assemble a chloroplast genome using short Illumina reads from whole-genome re-sequencing data.
ls -l /home/curso/data/denovo_cp
You should see the following files:
all_plastome_genes.fasta
matK.fasta
ndhF.fasta
sample402_R1.fq.gz
sample402_R2.fq.gz
sample692_R1.fq.gz
sample692_R2.fq.gz
sample893_R1.fq.gz
sample893_R2.fq.gz
Create a new folder in your $HOME to store output files:
mkdir denovo1
cd denovo1
How was the data prepared: These read pairs come from total genomic libraries (WGS) of some plant species. An initial de novo assembly was performed on entire read sets using NOVOPlasty. To simplify the assembly for the practice, we mapped back the original WGS reads to this initial assembly and retained only the reads that mapped. In this way, we obtained a subset of reads that (hopefully) come only from the chloroplast.
For this practice, instead of providing you with a ready-to-use command, you will have to construct your own spades.py
command. You can check the multiple options in the website for the Spades manual or just by typing in the terminal:
spades.py
You must set a pair-end assembly, only execute the assembler (i.e. skip the error correction step), limit the number of processors to 4, and set the k-mer sizes manually.
Here you can find the most important options to construct your command to run spades:
-o <output_dir> directory to store all the resulting files (required)
-1 <filename> file with forward paired-end reads
-2 <filename> file with reverse paired-end reads
--only-assembler runs only assembling (without read error correction)
-k <int> [<int> ...] list of k-mer sizes (must be odd and less than 128)
(e.g. 27,77,117,127)
-t <int>, --threads <int> number of threads. [default: 16]
Optional task Try several k-mer sizes combinations and save each assembly in a different output folder. You may choose a single sample or work with the three samples. Since the assembly itself takes very little time you can try different options.
After your assemblies are finished, download the entire de novo folder to your local computer to visualize the results using Bandage. Download the Bandage and open it on your computer.
Load the .gfa
file you got in the output directory:
And click the Draw graph
button:
You can visualize the assembled contigs (color bars) and connections between contigs or nodes (thin black lines). In the example above, you can find a big assembled graph (two contig-bubbles linked by two linear contigs. The remaining small contigs in the bottom can be considered noisy fragments (not part of the chloroplast genome) and be discarded (select the nodes and press shift+del
to completely remove them from the graph).
Now, we still have few small non-resolved portions of the assembly. First, you need to interpret the obtained assembly graph considering the expected structure of the chloroplast genome (see figure below). Chloroplast genomes (cpDNA) are circular and relatively conserved among land plants in terms of size, structure, and gene content. Most chloroplast genomes contain two inverted repeats (IRa and IRb) that divide the circular genome into a large single copy (LSC) region and a short single copy (SSC) region.
Chloroplast genome model. Source: Wikipedia
Then, you can select the low-coverage and unconnected nodes from the graphs (see example below), since those are more likely to result from erroneously retained reads similar to some fragments of the chloroplast.
After you removed the 'contamination' you can merge the nodes (Edit -> Merge all possible nodes) and obtain a resolved chloroplast genome. If you click again on Draw graph
, you should obtain a graph similar to the figure below. You can verify the inverted repeat region has twice the depth of the single-copy regions.
If you want to resolve the full circular genome, you can proceed to duplicate the inverted repeat and select the proper paths using some known genes. Now, select the inverted repeat and duplicate the node (Edit -> Duplicate selected nodes). You can click and drag the nodes to see better the connections.
Now, using the built-in BLAST tool in Bandage you can use two chloroplast genes that indicate the proper direction of the genome. In this case, we can use the genes _mat_K and ndh_F, which are two proximal genes separated by the IRb. Click on the Create/view BLAST search. In the new window, click on the Build BLAST database -> Load from FASTA file and select the two provided fasta files in the denovo folder (ndhF.fasta and matK.fasta). Then, click on Run BLAST search and close the window.
Now, since we identified the positions of the matK gene (green) and the ndhF gene (blue), we can retain the paths that keep the two genes separated by one of the IRs. Then, we proceed to select the edges and exclude them with shift+del
to completely remove them from the graph.
then, we have a fully resolved chloroplast genome
If you click on Draw graph
again, you can get a nice circular assembly.
Here you can find youtube videos with an introduction to Bandage and basic controls produced by Ryan Wick the author of Bandage
.
Aim: Compare two approaches for transcriptome assembly using Illumina reads
The data for this practice include clean short read fastq files saved in the folder
/home/curso/data/denovo_RNA
.
Please do not copy the files. You can input the files without copying them by adding the full path.
MEGAHIT is a fast and memory-efficient assembler. Although it is optimized for metagenome assemblly, it also works well on single genome assembly. The minimum requirements for MEGAHIT are the sequence reads provided with the options -1
and -2
and the comma-separated list of kmer size. In MEGAHIT, instead of the kmer size list, you can also provide the minimum and maximum kmer sizes (must be odd numbers) and an increment of kmer size of each iteration (must be an even number). See the example below:
Before running the command, two new bash tricks we can learn here:
- In
bash
the character\
means scape. Therefore, the five-line megahit command below will be interpreted as a single command in the terminal.
- DATADIR in the first command becomes an environment variable that stores the full (absolute) path of the directory containing the fastq files. We can use this variable to reduce the length of the commands, making a script easier to read. Every time we use $ next to the variable name, it will be interpreted as the string stored on it (the folder path in this case).
DATADIR=/home/curso/data/denovo_RNA/02_clean
megahit -1 $DATADIR/SRR2014835_R1.fastq.gz \
-2 $DATADIR/SRR2014835_R2.fastq.gz \
-o output_megahit \
--k-min 27 --k-max 127 --k-step 20 \
-t 2 -m 0.4
The MEGAHIT process with the configuration proposed in the example should take around 30-60 minutes to run. MEGAHIT outputs one main FASTA file named final.contigs.fa
saved among other files and folders in the folder indicated with the -o
option (output_megahit
in the example above).
rnaSPAdes
is a part of the SPAdes
package.
rnaSPAdes is a tool for de novo transcriptome assembly from RNA-Seq data (mainly Illumina RNA-Seq data). It also supports hybrid assembly from short and long reads (e.g. PacBio or Oxford Nanopore RNA reads).
The command line is very similar to the MEGAHIT. Following the manual, by default rnaSPAdes uses only 2 k-mer sizes, which are automatically detected using read length (approximately one-third and half of the maximal read length). They recommend not to change this parameter (especially smaller k-mer sizes) to avoid chimeric (misassembled) transcripts.
DATADIR=/home/curso/data/denovo_RNA/02_clean
spades.py --rna \
-1 $DATADIR/SRR2014835_R1.fastq.gz \
-2 $DATADIR/SRR2014835_R2.fastq.gz \
-o output_spades
This process could take between 1-2 hours!
rnaSPAdes
outputs one main FASTA file named transcripts.fasta
. In addition, you can find transcripts with different levels of filtration:
- hard_filtered_transcripts.fasta – includes only long and reliable transcripts with high expression.
- soft_filtered_transcripts.fasta – includes short and low-expressed transcipts, likely to contain junk sequences.
In this practice, we will evaluate the transcripts.fasta
file.
We now will proceed to compare the output assemblies obtained with MEGAHIT and rnaSPAdes. For this purpose, we can use stats.sh
.
Now, we can proceed to run the processes for the assembly evaluations. We need to run one process for each assembled transcriptome.
For the MEGAHIT output:
stats.sh output_megahit/final.contigs.fa
A C G T N IUPAC Other GC GC_stdev
0.2918 0.2121 0.2093 0.2867 0.0000 0.0000 0.0000 0.4215 0.0591
Main genome scaffold total: 48862
Main genome contig total: 48862
Main genome scaffold sequence total: 54.487 MB
Main genome contig sequence total: 54.487 MB 0.000% gap
Main genome scaffold N/L50: 11137/1.582 KB
Main genome contig N/L50: 11137/1.582 KB
Main genome scaffold N/L90: 34660/496
Main genome contig N/L90: 34660/496
Max scaffold length: 15.785 KB
Max contig length: 15.785 KB
Number of scaffolds > 50 KB: 0
% main genome in scaffolds > 50 KB: 0.00%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 48,862 48,862 54,487,352 54,487,352 100.00%
100 48,862 48,862 54,487,352 54,487,352 100.00%
250 48,803 48,803 54,473,999 54,473,999 100.00%
500 34,487 34,487 48,958,003 48,958,003 100.00%
1 KB 20,240 20,240 38,782,673 38,782,673 100.00%
2.5 KB 3,769 3,769 12,796,513 12,796,513 100.00%
5 KB 242 242 1,486,697 1,486,697 100.00%
10 KB 3 3 38,387 38,387 100.00%
For the rnaSPAdes output:
stats.sh output_spades/transcripts.fasta
A C G T N IUPAC Other GC GC_stdev
0.2890 0.2080 0.2096 0.2934 0.0001 0.0000 0.0000 0.4176 0.0672
Main genome scaffold total: 87205
Main genome contig total: 87433
Main genome scaffold sequence total: 77.786 MB
Main genome contig sequence total: 77.776 MB 0.013% gap
Main genome scaffold N/L50: 15490/1.623 KB
Main genome contig N/L50: 15506/1.621 KB
Main genome scaffold N/L90: 53855/318
Main genome contig N/L90: 53931/318
Max scaffold length: 15.792 KB
Max contig length: 15.792 KB
Number of scaffolds > 50 KB: 0
% main genome in scaffolds > 50 KB: 0.00%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 87,205 87,433 77,786,055 77,775,785 99.99%
50 87,204 87,432 77,786,006 77,775,736 99.99%
100 87,155 87,383 77,782,893 77,772,623 99.99%
250 66,432 66,636 73,543,443 73,533,419 99.99%
500 42,138 42,315 65,429,285 65,419,566 99.99%
1 KB 28,047 28,194 55,140,005 55,131,273 99.98%
2.5 KB 5,711 5,764 19,464,127 19,460,245 99.98%
5 KB 384 393 2,359,054 2,358,154 99.96%
10 KB 6 6 72,027 72,027 100.00%
You can now make a table summarizing the results of the evaluation of the MEGAHIT and rnaSPAdes assemblies, compare and interpret the results.
Alternatively, you can use the software rnaQUAST. It provides informative reports using reference genome and gene databases. It allows performing de novo evaluation using several third-party tools, such as BUSCO and GeneMarkS-T.