title | tags | |||||||
---|---|---|---|---|---|---|---|---|
README.md |
|
Trainers: John Juma, Ouso Daniel & Gilbert Kibet
- Introduction
- Scope
- Background
- Prerequisite
- Set-Up
- Preparations
- Analysis
- Summarize results
- Download reports
- Data Retrieval and Review
- Working with metadata
- Galaxy workflows for SARS-CoV-2 data analysis
- Activity
In early January 2020, the novel coronavirus (SARS-CoV-2) responsible for a pneumonia outbreak in Wuhan, China, was identified using next-generation sequencing (NGS) and readily available bioinformatics pipelines. In addition to virus discovery, these NGS technologies and bioinformatics resources are currently being employed for ongoing genomic surveillance of SARS-CoV-2 worldwide, tracking its spread, evolution and patterns of variation on a global scale.
In this short workshop we will tackle, hands-on, the basic principles employed by the numerous bioinformatic pipelines: to generate consensus genome sequences of SARS-CoV-2 and identify variants using an actual dataset generated in our facility.
Note
This is part of the initiative fronted by the Africa CDC with generous support from the Rockeffeler foundation to build capacity in pathogen genomics in Africa.
We will use a dataset comprising of raw sequence reads of SARS-CoV-2 samples
obtained from a sequencing run on NextSeq 550 platorm at ILRI.
NextSeq 550 flowcell uses 4 lanes; and so, 4 reads of data per sequenced sample corresponding to the https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars1_R1.fastq.gz https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars1_R2.fastq.gz https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars2_R1.fastq.gz https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars2_R2.fastq.gz https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars3_R1.fastq.gz https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars3_R2.fastq.gz4
lanes are generated with suffixes L001, L002, L003 and L004. The dataset we are
using in this tutorial comprises of already concatenated sequences. These reads
can be combined/concatenated into a single file bearing in mind the type of library
sequencing either single
or paired-end
.
This module will come after the introductory Linux module and therefore assumes familiarity with basic Linux command-line use. It also assumes you have an account and are operating in the ILRI computing cluster from a local Linux environment.
Note
Once inside the
hpc
, all instances of$USER
will be equivalent to the hpc username that you were assigned, for exampleBio4Info$$
. Your username, by default, is stored in a variable calledUSER
. By using it, you will not have to type-in your username, rather, your shell will automatically pick your username which is the value stored in theUSER
variable. The$
(dollar) character-prefix to a variable name is used to call the value of that variable.
We will use the computer lab at ILRI, which is already equipped with Linux-operating desktop computers. Since we will be working from the remote servers, we will not need special setup for personal laptops. However, toward the end of the program, we can look into access to a Linux server from a Windows PC; or how to install a Linux (sub)system for any interested persons.
From the terminal (or equvalent tool) of your local computer, you can log into the HPC using the folowing command line, followed by pressing . You will be promted to type-in your password (the password will not be visible as you type it; just have faith). On a Linux system, you can use the Ctrl-Alt-T
7keyboard shortcut to open a terminal.
ssh <user_name>@hpc.ilri.cgiar.org
The HPC head node has 4 CPUs and we need to access more CPUs/resources in other compute nodes.
You will have to move from the cluster's master node into the node where we will be working from (it is called compute05
). Use the following command; -w
requests (a) specific list of host(s).
interactive -w compute05
ssh
allows you to securely connect to the remote computer over internet, while interactive
allows you to reserve resources to work interactively in a specified node within the computing cluster using the -w
flag.
Note When running a job interactively, the time limit is 8 hours and Default number of CPU is 1.
- We then change into the
compute05
scratch
directory to create our project directory. Using the option-p
(parent)mkdir
will create any missing intermediate directories.cd /var/scratch/ mkdir -p $USER/AfricaCDC_training cd $USER/AfricaCDC_training
- The
assets, databases, primer-schemes, scripts
directories will be linked to the project directory, to limit redundancy.-s
(soft) means that we are creating a soft link.ln -s /var/scratch/global/AfricaCDC_training/[adps]* .
- We will create the directories
data
,results
andgenome
to store raw data infastq
format, output and reference genomes respectively. Intermediate output files pertool/software
will be created within theresults
directory. We will exploit the bash array data structure to create all the directories at once.mkdir data genome results mkdir -p results/{fastqc,fastp,kraken,samtools,ivar,snpeff,pangolin,nextclade,multiqc,bowtie2,bedtools}
- Change into the
data
directory, from where we will retrieve ourfastq
files.cd data ls
-
While there are specialised tools for data retieval from nucleotide sequence databases, universal
Unix
command (wget
) can be used to download data over internet.wget --no-check-certificate https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars-fastqs.tar.gz
-
After downloading your data, say from a sequencing facility site, it is often good practice to verify that your data was not intentionally/accidentally tampered with. To do this, your data service provider will likely accompany your data with a file containing a verification code:
checksum_file
(will be provided). Themd5sum
command, using the-c
(check) tag, allows for checking the integrity of a file downloaded or acquired from a different source.wget --no-check-certificate https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars-fastqs.md5 ls md5sum -c sars-fastqs.md5
-
Next, we will unzip the file using
tar
with the-xf
(extract, file; respectively) tags, which tellstar
extract the given file.tar -xf sars-fastqs.tar.gz ls
-
Download SARS-CoV-2 reference genome and the genome annotation file.
We will retrieve SARS-CoV-2 reference genome and the annotation from NCBI.
-
On a web browser, open the link NCBI.
-
Type 'SARS-CoV-2' on the search box and select 'Genome' database.
-
Select the Genbank hyperlink.
-
Select the genome version GCA_009858895.3_ASM985889v3.
-
Right click on the genome FASTA and select 'copy link'.
-
Change into the
genome
directory using the commandcd ../genome
. -
Use
wget
to fetch the file. -
Retrieve the feature annotation file GFF using
wget
command. -
Dowload the md5checksum using
wget
command and check for integrity of your reference genome (FASTA) and annotation (GFF) files.echo "$(grep *GCA_009858895.3_ASM985889v3_genomic.fna.gz* md5checksums.txt | cut -f1 -d' ') GCA_009858895.3_ASM985889v3_genomic.fna.gz" | md5sum -c - echo "$(grep *GCA_009858895.3_ASM985889v3_genomic.gff.gz* md5checksums.txt | cut -f1 -d' ') GCA_009858895.3_ASM985889v3_genomic.gff.gz" | md5sum -c -
-
If integrity check of the files has passed (
OK
), Uncompress the.gz
files
gunzip *.gz
- Rename the
FASTA
andGFF
filesmv GCA_009858895.3_ASM985889v3_genomic.fna nCoV-2019.fasta mv GCA_009858895.3_ASM985889v3_genomic.gff nCoV-2019.gff
-
-
Clear the environment.
module purge
-
Load modules using the
module load <tool-name>
command.module load fastqc/0.11.7 module load fastp/0.22.0 module load kraken/2.0.8-beta module load bowtie2/2.3.4.1 module load samtools/1.11 module load ivar/1.3.1 module load bedtools/2.29.0 module load R/4.2 module load bcftools/1.11 module load snpeff/4.1g module load multiqc/1.12 module load nextclade/1.11.0 module load python/3.9
Optional The above modules can also be loaded using a single command
module load fastqc/0.11.7 fastp/0.22.0 \ kraken/2.0.8-beta bowtie2/2.3.4.1 samtools/1.11 ivar/1.3.1 \ bedtools/2.29.0 R/4.2 bcftools/1.11 snpeff/4.1g multiqc/1.12 \ nextclade/1.11.0 python/3.9
-
To list the loaded modules, type the below command.
module list
-
While still in the
genome
directory, we will index the reference sequence using samtools'faidx
. Indexing produces a.fai
file consisting of five tab-separated columns:chrname, seqlength, first-base offset, seqlinewidth
without\n
(newline character) andseqlinewidth
with\n
. This is essential for samtools' operations.samtools faidx nCoV-2019.fasta
The above command generates the index for reference genome with the name
nCoV-2019.fasta.fai
. -
We can take a sneak-view of the generated file and manipulate it for fun, say, to extract the genome size of reference fasta. This can be extracted from the
faidx
-indexed genome file using thecut
command. The-f
specifies the field(s) of interest.cut -f 1,2 nCoV-2019.fasta.fai > nCoV-2019.fasta.sizes
-
In order to allow easy access of genome regions during read mapping we will index the reference genome using
bowtie2-build
command.mkdir /var/scratch/$USER/AfricaCDC_training/genome/bowtie2
bowtie2-build \ --threads 1 \ /var/scratch/$USER/AfricaCDC_training/genome/nCoV-2019.fasta \ /var/scratch/$USER/AfricaCDC_training/genome/bowtie2/nCoV-2019
The above command generates index files with the suffix
.bt2
for the reference genome with the prefixnCoV-2019.
-
Build SnpEff database for the reference genome
SnpEff, a variant annotation and predictor needs a database to perform genomic annotations. There are pre-built databases for thousands of genomes, so chances are that your organism of choice already has a SnpEff database available.
Note We will use pre-built SARS-CoV-2 SnpEff database
Optional In the (unlikely?) event that you need to build one yourself, you can build one using the commands found here
FastQC
is a common tool for Illumina read quality checks. The basic statistics from this report include total sequences
, sequence length
and %GC
. Another 10 measures of quality are also graphically represented. Your experimental design will be crirical in interpreting FastQC
reports. This step is very important for the subsequent data processes, especially at initial optimisation steps.
- Change into the results
fastqc
directorycd /var/scratch/$USER/AfricaCDC_training/results/fastqc/
- Run
fastqc
Optional Run step 3. above for the other 2 samples.fastqc \ -t 1 \ -o . \ /var/scratch/$USER/AfricaCDC_training/data/sars1_R1.fastq.gz \ /var/scratch/$USER/AfricaCDC_training/data/sars1_R2.fastq.gz
The preceeding step will guide us on the possible filtering and trimming operations to subject our data to. Depending on your study design, it is important to minimise noise as much as to zero, if possible. However, the latter case may be practically impossible.
-
Change into the output
fastp
directory.cd /var/scratch/$USER/AfricaCDC_training/results/fastp/
-
Run
fastp
.i,I
(input(s)) are for read1, read2; respectively.o,O
(output(s)) are for the respective read1, read2; respectively. The2>
construct redirects the standard error channel for saving as a log file.fastp \ -w 1 \ -i /var/scratch/$USER/AfricaCDC_training/data/sars1_R1.fastq.gz \ -I /var/scratch/$USER/AfricaCDC_training/data/sars1_R2.fastq.gz \ -o sars1_R1.trim.fastq.gz \ -O sars1_R2.trim.fastq.gz \ -h sars1.fastp.html \ -j sars1.fastp.json \ 2> sars1.fastp.log
Optional Run steps 3 and 4 above for the other 2 samples.
At times, sequencing experients will pick up non-target nucleic acids: for instance, host genetic material in SARS-CoV-2 sequencing. Such may obscure our signal of interest in the data; therefore, it is important to minimise or remove such sources of noise (unwanted background).
There are several bioinformatics tools and databases which can be used in querying the reads data in order to remove such noise. A commonly used tool is Kraken2.
Kraken2 is a fast and memory efficient tool for taxonomic assignment of metagenomics sequencing reads. Kraken2
can allow us to query the composition of our samples by searching for sequence reads against a pre-formatted database ("contaminant").
Note Preformatted Kraken 2 and Bracken indexes can be found here: https://benlangmead.github.io/aws-indexes/k2 and downloaded without need of building new ones from scractch.
In this tutorial, we will use pre-formatted kraken2 human
database to identify human-derived reads in our samples. This may give us an indication of contamination from host reads.
Quiz: What type of contaminants would you think of in a SARS-CoV-2 sequencing experiment?
Answer
Host DNA, Host RNA and Internal control (PhiX)-
Human database search
- Change into the
kraken
directory resultscd /var/scratch/$USER/AfricaCDC_training/results/kraken
- Run
kraken2
kraken2 \ -db /var/scratch/$USER/AfricaCDC_training/databases/kraken2-human-db \ --threads 1 \ --unclassified-out sars1.unclassified#.fastq \ --classified-out sars1.classified#.fastq \ --report sars1.kraken2.report.txt \ --output sars1.kraken2.out \ --gzip-compressed \ --report-zero-counts \ --paired /var/scratch/$USER/AfricaCDC_training/results/fastp/sars1_R1.trim.fastq.gz \ /var/scratch/$USER/AfricaCDC_training/results/fastp/sars1_R2.trim.fastq.gz
Quiz: How many reads have hit the human genome as targets in the sample(s)?
Answer
sars1 - 60529 reads (13.00%)
Quiz: What percent of the sequencing reads are classified as non-human?
More information on output formats can be found here.
Optional Run steps 1 above for the other 2 samples.
Quiz: Based on the results, which sample do you think will give us good results in terms of genome coverage?
- Change into the
Aligning sequence reads to a reference genome is the first step in many comparative genomics pipelines, including pipelines for variant calling, isoform quantitation and differential gene expression. In many cases, the alignment step is the slowest. This is because for each read the aligner must solve a difficult computational problem: determining the read's likely point of origin with respect to a reference genome. This is always non trivial for several reasons:
- The reference genome is often very big. Searching big things is harder than searching small things.
- You aren’t always looking for exact matches in the reference genome–or, at least, probably not.
Here we use Bowtie2, an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.
-
Change to the
bowtie2
directory.cd /var/scratch/$USER/AfricaCDC_training/results/bowtie2
-
Run the
bowtie2
command to align reads to the reference genome.bowtie2 \ -x /var/scratch/$USER/AfricaCDC_training/genome/bowtie2/nCoV-2019 \ -1 /var/scratch/$USER/AfricaCDC_training/results/kraken/sars1.unclassified_1.fastq \ -2 /var/scratch/$USER/AfricaCDC_training/results/kraken/sars1.unclassified_2.fastq \ --threads 1 \ --un-conc-gz sars1.unmapped.fastq.gz \ --local \ --very-sensitive-local \ 2> sars1.bowtie2.log \ | samtools view -@ 1 -F4 -bhS -o sars1.trim.dec.bam -
Optional Run steps 1 and 2. above for the other 2 samples.
Alignments can often be manipulated using samtools
using several sub commands
-
Sort the converted binary alignment (
.bam
)samtools sort -@ 1 -o sars1.sorted.bam -T sars1 sars1.trim.dec.bam
-
Index the sorted alignment
samtools index -@ 1 sars1.sorted.bam
Optional Run steps 1 and 2. above for the other 2 samples.
Trim amplicon primers using iVar
iVar uses primer positions supplied in a BED file to soft clip primer sequences from an aligned and sorted BAM file. Following this, the reads are trimmed based on a quality threshold (Default: 20). To do the quality trimming, iVar uses a sliding window approach (Default: 4). The windows slides from the 5' end to the 3' end and if at any point the average base quality in the window falls below the threshold, the remaining read is soft clipped. If after trimming, the length of the read is greater than the minimum length specified (Default: 30), the read is written to the new trimmed BAM file.
-
Change to the output directory
ivar
cd /var/scratch/$USER/AfricaCDC_training/results/ivar/
-
Run the command to trim primers
ivar trim \ -i /var/scratch/$USER/AfricaCDC_training/results/bowtie2/sars1.sorted.bam \ -b /var/scratch/$USER/AfricaCDC_training/primer-schemes/V3/nCoV-2019.primer.bed \ -p sars1.primertrimmed \ -m 30 \ -q 20 > sars1.ivar.log
-
Sort the primer trimmed alignment
samtools sort \ -@ 1 \ -o sars1.primertrimmed.sorted.bam \ -T sars1 sars1.primertrimmed.bam
-
Index the sorted primer trimmed alignment
samtools index -@ 1 sars1.primertrimmed.sorted.bam
Here we will use bedtools, your swiss-army knife for genomic arithmetic and interval manipulation.
-
Change to the output directory
bedtools
cd /var/scratch/$USER/AfricaCDC_training/results/bedtools/
-
Compute coverage
bedtools \ genomecov \ -d \ -ibam \ /var/scratch/$USER/AfricaCDC_training/results/ivar/sars1.primertrimmed.sorted.bam \ > sars1.coverage
-
Plot to visualize
Rscript /var/scratch/$USER/AfricaCDC_training/scripts/plotGenomecov.R sars1.coverage
iVar
uses the output of the samtools mpileup
command to call variants - single nucleotide variants(SNVs) and indels.
Pileup format consists of TAB-separated lines, with each line representing the pileup of reads at a single genomic position.
Several columns contain numeric quality values encoded as individual ASCII characters. Each character can range from "!" to "~" and is decoded by taking its ASCII value and subtracting 33; e.g., "A" encodes the numeric value 32.
The first three columns give the position and reference:
- Chromosome name.
- 1-based position on the chromosome.
- Reference base at this position (this will be "N" on all lines if
-f
or--fasta-ref
has not been used)
In generating the mpileup, we will use the flags:
--count-orphans
: Do not skip anomalous read pairs in variant calling. Anomalous read pairs are those marked in the FLAG field as paired in sequencing but without the properly-paired flag set.
--ignore-overlaps
: Disable read-pair overlap detection
--no-BAQ
: Disable base alignment quality (BAQ) computation
The tee
command, used with a pipe, reads standard input from samtools mpileup
, then writes the output of the program to standard output and simultaneously copies it into the specified file .mpileup
In order to call variants correctly, the reference file used for alignment must be passed to iVar
using the -r
flag. The output of samtools pileup is piped into ivar variants to generate a .tsv
file with the variants. There are two parameters that can be set for variant calling using iVar
- minimum quality (Default: 20) and minimum frequency (Default: 0.03). Minimum quality is the minimum quality for a base to be counted towards the ungapped depth to calculate iSNV frequency at a given position. For insertions, the quality metric is discarded and the mpileup depth is used directly. Minimum frequency is the minimum frequency required for a SNV or indel to be reported.
iVar
can identify codons and translate variants into amino acids using a GFF format containing the required coding regions (CDS). In absence of a GFF file, iVar will not perform the translation and "NA" will be added to the output file in place of the reference and alternate codons and amino acids.
-
Change to the output directory
ivar
cd /var/scratch/$USER/AfricaCDC_training/results/ivar/
-
Call variants
samtools mpileup \ --ignore-overlaps \ --count-orphans \ --no-BAQ \ --max-depth 0 \ --min-BQ 0 \ --reference /var/scratch/$USER/AfricaCDC_training/genome/nCoV-2019.fasta \ /var/scratch/$USER/AfricaCDC_training/results/ivar/sars1.primertrimmed.sorted.bam \ | tee sars1.mpileup \ | ivar \ variants \ -t 0.25 \ -q 20 \ -m 10 \ -g /var/scratch/$USER/AfricaCDC_training/genome/nCoV-2019.gff \ -r /var/scratch/$USER/AfricaCDC_training/genome/nCoV-2019.fasta \ -p sars1.variants
-
Convert the variants from
.tsv
to.vcf
(Variant Call Format)python /var/scratch/$USER/AfricaCDC_training/scripts/ivar_variants_to_vcf.py \ sars1.variants.tsv \ sars1.vcf \ --pass_only \ --allele_freq_thresh 0.75 > sars1.variant.counts.log
The header begins the file and provides metadata describing the body of the file. Header lines are denoted as starting with
#
. Special keywords in the header are denoted with##
. Recommended keywords include fileformat, fileDate and reference.The header contains keywords that optionally semantically and syntactically describe the fields used in the body of the file, notably INFO, FILTER, and FORMAT.
Name Brief description (see the specification for details)VCF. 1 CHROM The name of the sequence (typically a chromosome) on which the variation is being called. 2 POS The 1-based position of the variation on the given sequence. 3 ID The identifier of the variation, e.g. a dbSNP rs identifier, or if unknown a ".". Multiple identifiers should be separated by semi-colons without white-space. 4 REF The reference base (or bases in the case of an indel) at the given position on the given reference sequence. 5 ALT The list of alternative alleles at this position. 6 QUAL A quality score associated with the inference of the given alleles. 7 FILTER A flag indicating which of a given set of filters the variation has failed or PASS if all the filters were passed successfully. 8 INFO An extensible list of key-value pairs (fields) describing the variation. 9 FORMAT An (optional) extensible list of fields for describing the samples + SAMPLES For each (optional) sample described in the file, values are given for the fields listed in FORMAT -
Compress vcf file
bgzip -c sars1.vcf > sars1.vcf.gz
-
Create tabix index from a sorted bgzip tab-delimited genome file
tabix -p vcf -f sars1.vcf.gz
-
Generate stats from VCF file
bcftools stats sars1.vcf.gz > sars1.stats.txt
We will use SnpEff. It annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes). It requires a configured SnpEff database with the annotation or features of the genome.
-
Change to the output directory
snpeff
cd /var/scratch/$USER/AfricaCDC_training/results/snpeff/
-
Annotate and predict variants
java -Xmx4g -jar /export/apps/snpeff/4.1g/snpEff.jar \ nCoV-2019 \ -c /var/scratch/$USER/AfricaCDC_training/databases/snpeff_db/snpeff.config \ -dataDir /var/scratch/$USER/AfricaCDC_training/databases/snpeff_db/data \ /var/scratch/$USER/AfricaCDC_training/results/ivar/sars1.vcf.gz \ > sars1.ann.vcf
-
Compress vcf file
bgzip -c sars1.ann.vcf > sars1.ann.vcf.gz
-
Rename the
summary.html
andgenes.txt
filemv snpEff_summary.html sars1.summary.html mv snpEff_genes.txt sars1.genes.txt
-
Create tabix index from a sorted bgzip tab-delimited genome file
tabix -p vcf -f sars1.ann.vcf.gz
-
Generate stats from VCF file
bcftools stats sars1.ann.vcf.gz > sars1.stats.txt
-
Filter variants SnpSift annotates genomic variants using databases, filters, and manipulates genomic annotated variants. Once you annotated your files using SnpEff, you can use SnpSift to help you filter large genomic datasets in order to find the most significant variants for your experiment.
java -Xmx4g -jar /export/apps/snpeff/4.1g/SnpSift.jar \ extractFields \ -s "," \ -e "." \ sars1.ann.vcf.gz \ CHROM POS REF ALT \ "ANN[*].GENE" "ANN[*].GENEID" \ "ANN[*].IMPACT" "ANN[*].EFFECT" \ "ANN[*].FEATURE" "ANN[*].FEATUREID" \ "ANN[*].BIOTYPE" "ANN[*].RANK" "ANN[*].HGVS_C" \ "ANN[*].HGVS_P" "ANN[*].CDNA_POS" "ANN[*].CDNA_LEN" \ "ANN[*].CDS_POS" "ANN[*].CDS_LEN" "ANN[*].AA_POS" \ "ANN[*].AA_LEN" "ANN[*].DISTANCE" "EFF[*].EFFECT" \ "EFF[*].FUNCLASS" "EFF[*].CODON" "EFF[*].AA" "EFF[*].AA_LEN" \ > sars1.snpsift.txt
To generate a consensus sequence iVar uses the output of samtools mpileup command. The mpileup output must be piped into ivar consensus. There are five parameters that can be set:
- minimum quality
-q
(Default: 20). - minimum frequency threshold
-t
(Default: 0). - minimum depth to call a consensus
-m
(Default: 10). - a flag
-n
to exclude nucleotides from regions with depth less than the minimum depth and a character to call in regions with coverage lower than the speicifed minimum depth (Default: 'N').
Minimum quality is the minimum quality of a base to be considered in calculations of variant frequencies at a given position. Minimum frequency threshold is the minimum frequency that a base must match to be called as the consensus base at a position. If one base is not enough to match a given frequency, then an ambigious nucleotide is called at that position. Minimum depth is the minimum required depth to call a consensus. If -k
flag is set then these regions are not included in the consensus sequence. If -k
is not set then by default, a 'N' is called in these regions. You can also specfy which character you want to add to the consensus to cover regions with depth less than the minimum depth. This can be done using -n
option. It takes one of two values: -
or N
.
-
Change to the output directory
ivar
cd /var/scratch/$USER/AfricaCDC_training/results/ivar/
-
Generate pileup and consensus genome sequences
samtools \ mpileup \ --reference /var/scratch/$USER/AfricaCDC_training/genome/nCoV-2019.fasta \ --count-orphans \ --no-BAQ \ --max-depth 0 \ --min-BQ 0 \ -aa \ /var/scratch/$USER/AfricaCDC_training/results/ivar/sars1.primertrimmed.sorted.bam \ | tee sars1.mpileup \ | ivar \ consensus \ -t 0.75 \ -q 20 \ -m 10 \ -n N \ -p sars1.cons
The tee
command reads from the standard input and writes to both standard output and one or more files at the same time. tee
is mostly used in combination with other commands through piping.
-
Rename the consensus genome header
sed -i '/^>/s/Consensus_\(.*\)_threshold.*/\1/' sars1.cons.fa
Nextclade is a tool within the Nextrain collection that uses sequence differences for their assignment to clades. It also reports suspect quality issues with such sequences. There are both web- and command-line-interfaces for nextclade. To run it in the command-line, we need some reference files: genome, feature map, origin tree, primers and quality configurations. Luckily, for SARS-CoV-2, these can be easily retrieved using the same tool, otherwise, you will have to create/retrieve accordingly.
-
Get the reference dataset
nextclade dataset get --name 'sars-cov-2' --reference 'MN908947' --output-dir /var/scratch/$USER/AfricaCDC_training/nextclade_db
-
Perform clade assignment
nextclade \ --input-fasta /var/scratch/$USER/AfricaCDC_training/results/ivar/sars1.cons.fa \ --input-dataset /var/scratch/$USER/AfricaCDC_training/nextclade_db \ --input-root-seq /var/scratch/$USER/AfricaCDC_training/nextclade_db/reference.fasta \ --genes E,M,N,ORF1a,ORF1b,ORF3a,ORF6,ORF7a,ORF7b,ORF8,ORF9b,S \ --input-gene-map /var/scratch/$USER/AfricaCDC_training/nextclade_db/genemap.gff \ --input-tree /var/scratch/$USER/AfricaCDC_training/nextclade_db/tree.json \ --input-qc-config /var/scratch/$USER/AfricaCDC_training/nextclade_db/qc.json \ --input-pcr-primers /var/scratch/$USER/AfricaCDC_training/nextclade_db/primers.csv \ --output-csv /var/scratch/$USER/AfricaCDC_training/results/nextclade/sars1.csv \ --output-tree /var/scratch/$USER/AfricaCDC_training/results/nextclade/sars1.auspice.json \ --output-dir /var/scratch/$USER/AfricaCDC_training/results/nextclade/ \ --output-basename sars1.cons \ 2> /var/scratch/$USER/AfricaCDC_training/results/nextclade/sars1.nextclade.log
-
Visualization: The output of Nextclade includes a phylogenetic tree in
.json
format. This tree can be visualized in Auspice. First let us download the the.json
file:
cp /var/scratch/$USER/AfricaCDC_training/results/nextclade/sars1.auspice.json ~/
In your local computer use scp
to copy the file to any desired destination:
scp <user_name>@hpc.ilri.cgiar.org:/home/<user_name>/*.json <destination_folder>
Open Auspice and drag and drop the .json
file in the Auspice. Now edit the tree.
- In
Dataset
click the drop down arrow and selectncov
, below it selectopen
and below it selectglobal
. - In
Color By
click the drop down arrow and selectclade
. - Do any other adjustments as you wish.
Phylogenetic Assignment of Named Global Outbreak Lineages (Pangolin) implements a dynamic nomenclature of SARS-CoV-2 lineages, known as the Pango nomenclature. To assign Pangolin Lineages, we will use the web version of pangolin. It also has a robust command-line version that we will look into later. With the web version we must retrieve out consensus genome from the analysis server (HPC) to our local computer. Follow the following steps to assign your query sequences pangolin lineages. Also here is a tutorial.
- In your bowser open Pangolin Web Application. This is the online version of Pangolin.
- Now copy-paste/drag-drop your consensus file to the site and click
Start Analysis
- Once done, download the results and if you need to, copy the names of sequences that failed the analysis to a file. The download is called
results.csv
.
Here is how pangolin performs the analysis:
Alternatively to perform the commandline analysis for Pangolin, let us proceed as follows. We will need to use a singularity image (think of a singularity image as a ready-to-use container within which we have packaged all the software needed to do a certain task) in this case packaging pangolin softwares*.
- Let us create a directory to store our image:
mkdir /var/scratch/$USER/AfricaCDC_training/singularity
- Download the image:
singularity pull --dir /var/scratch/$USER/AfricaCDC_training/singularity/ \
--force docker://staphb/pangolin:latest
- Download Pangolin's referemce data: Downloads/updates to latest release of pangoLEARN and constellations
singularity run /var/scratch/$USER/AfricaCDC_training/singularity/pangolin_latest.sif \
pangolin --update-data \
--datadir /var/scratch/$USER/AfricaCDC_training/pangolin_db
- Conduct Pangolin Lineage assignment:
singularity run /var/scratch/$USER/AfricaCDC_training/singularity/pangolin_latest.sif
pangolin /var/scratch/$USER/AfricaCDC_training/results/ivar/sars1.cons.fa \
--alignment \
--usher \
--max-ambig 0.3 \
--min-length 25000 \
--outdir /var/scratch/$USER/AfricaCDC_training/results/pangolin/ \
--outfile sars1.pangolin.usher.csv \
--datadir /var/scratch/$USER/AfricaCDC_training/pangolin_db/ \
2> /var/scratch/$USER/AfricaCDC_training/results/pangolin/sars1.pangolin.usher.log
Running nf-core/viralrecon pipeline.
The nf-core -"A community effort to collect a curated set of analysis pipelines built using Nextflow" - has [many pipelines that can be easily setup and used to analyse genomics data. We will be using nf-core/viralrecon for our analysis. The most recent verion being 2.4.1.
To set it up we will follow the workflow in Launch pipeline.
Our data is stored in /var/scratch/global/ilri_AuCDC/miseq
. We need to:
- SSH into HPC:
ssh <username>@hpc.ilri.cgiar.org
- Go into ineractive mode in compute05
interactive -c 3 -J nextflow
- Create a working directory in compute node:
mkdir -p /var/scratch/$USER/miseq_analysis/
cd /var/scratch/$USER/miseq_analysis/
mkdir results work
- Symbolicly link our data to the to a directory in
scratch
mkdir -p /var/scratch/global/$USER/miseq_analysis/
cd /var/scratch/global/$USER/miseq_analysis/
ln -s /var/scratch/global/ilri_AuCDC/miseq/* ./
-
Now let us go back to Launch pipeline and do step by step set up.
-
Finally let us transfer
the parameters JSON to a file
to the HPC. -
We can now launch the analysis as follows:
- Load modules and set some Java options
module load nextflow/21.10
NXF_OPTS='-Xms1g -Xmx4g'
-Change into working directory
cd /var/scratch/$USER/miseq_analysis/
- Launch the analysis:
nextflow run nf-core/viralrecon -r 2.4.1 -profile singularity -resume -work-dir /var/scratch/$USER/miseq_analysis/work -params-file /var/scratch/global/$USER/miseq_analysis/nf-params.json
- Wait for the run to be completed.
- Let us download the multiQC file and visualize as follows:
- Check the
results
directory andresults/multiqc
directory:
ls ./results ls ./results/multiqc
- Download the
./results/multiqc/multiqc_report.html
and visualize - Clue: use the command
scp
- Check the
Aggregate results from bioinformatics analyses across many samples into a single report with MultiQC
- Change to the output directory
multiqc_out
cd /var/scratch/$USER/AfricaCDC_training/results/multiqc/
- Aggregate tools' outputs
$ multiqc \ --force \ --title SARS-CoV-2 \ --export \ --outdir . \ --config /var/scratch/$USER/AfricaCDC_training/assets/multiqc_config.yaml \ /var/scratch/$USER/AfricaCDC_training/results/
-
To help with file transfers, we will create a user-specific directory inside the
global
temporary directory of the head-node.$ mkdir /var/scratch/global/AfricaCDC_training_outputs/$USER
-
Copy output files of interest (multiqc, primer-trimmed bam).
cp /var/scratch/$USER/AfricaCDC_training/results/ivar/*.sorted.bam* \ /var/scratch/global/AfricaCDC_training_outputs/$USER/ cp -r SARS-CoV-2_multiqc_report_plots *.html \ /var/scratch/global/AfricaCDC_training_outputs/$USER/
-
On your local computer, open a terminal and create a directory in the
Downloads
directorymkdir -p /mnt/c/Downloads/AfricaCDC_training/results cd /mnt/c/Downloads/AfricaCDC_training/results/
-
Copy all the contents of the
/var/scratch/global/AfricaCDC_training_outputs/<username>/
directory in the HPC to the local outputs directory you created in the previous step.Note
Replace
username
with the actual provided HPC account usernamersync \ -avP \ <username>@hpc.ilri.cgiar.org:/var/scratch/global/AfricaCDC_training_outputs/<username>/* \ .
OR
scp <username>@hpc.ilri.cgiar.org:/var/scratch/global/AfricaCDC_training_outputs/<username>/* \ .
Having sequenced our samples in both MiSeq (Illumina) and MinION (ONT) we can now transfer the sequence output to the HPC where we will conduct the bioinformatics analysis.
MiSeq is based on Windows and data transfer will be done by copy-pasting the data to HPC through the network. Ensure the transfer is completed successfully without errors.
The MinION sequencer stores its sequencing output in a Linux based computer. To transfer the data we logged into computer and transferred the data on the command line as follows.
rsync -avP <path-to-the-directory-with_sequencing-ouput>/ <username>:<path-to-the-directory-to-store-sequencing-ouput>/
Replaced <path-to-the-directory-with_sequencing-ouput>
with the path to the directory storing the sequencing output. Replaced <HPC-login-username>
with your hpc login username (i.e user##@hpc.ilri.cgiar.org) and <path-to-the-directory-to-store-sequencing-output>
with path to the directory you want to store the data in the HPC.
Example: rsync -avP /media/SeqData_LTS/20220405_1121_MN2816_FAH91436_2b8e9827/ <username>@hpc.ilri.cgiar.org:/var/scratch/global/$USER/20220405_1121_MN2816_FAH91436_2b8e9827
Change working directory into the directory that stores the Illumina dataset
cd /var/scratch/global/miseq/
Now let us view what is the sequencing output. The output is a FASTA format. Note: in the second command replace <one-of-the-fastq.gz>
with the name of the fastq.gz file
ls fastq/
less -S fastq/<one-of-the-fastq.gz>
Change working directory into directory that stores the ONT data. Note: replace <name-of-run-folder>
with the name of the run folder
cd /var/scratch/global/ONT/
ls <name-of-run-folder>
Metadata is the data associated with your main data; it describes your data in a manner that can allow drawing information upon analysing the data. Often, the importance of metadata
is ignored; things like how to capture, store, encode and organise metadata, especially, having the downstream analyses and interpretation in mind.
IMPORTANT Data without metadata is, mostly, garbage.
We will take a look at an example metadata to highlight some concerns with metadata, and the reason(s) why they are important in data analyses workflows.
test_lab | case_id | lab_id | loc | age | gender | occup | samp_type | symp | vacc_state | coll_dt | confir_dt | recep_dt |
---|---|---|---|---|---|---|---|---|---|---|---|---|
TESTING_LAB | CASE-ID | SampleNumber | LOCATION | AGE | GENDER (M/F) | OCCUPATION | SAMPLE TYPE | SYMPTOMS SHOWN (COUGH;FEVER;ETC) | VACCINATION STATUS | DATE OF SAMPLE COLLECTION | LAB CONFIRMATION DATE | DATE SAMPLE RECEIVED IN THE LAB |
LABA | place/id/date | COVD0308 | Some Place | 80 | F | Food handler | NP & OP Swab | Asymptomatic | Yes | 5th June 2020 | 08-Jun-20 | 08/06/20 |
LABB | COM/SARS001/2022 | COVD360 | Comoros | 38 | F | None | NP-OP Swab | FC;CO;H | Yes | 5th June 2020 | 08-Jun-20 | 08/06/20 |
LABC | DJI/SARS001/2023 | COVD 273 | Djibouti | 30 | M | Business | NP-OP Swab | Asymptomatic | No | 5th June 2020 | 08-Jun-20 | 08/06/20 |
LABD | SWZ/SARS001/2024 | COVD154 | Eswatini | 23 | Female | Food seller | OP-NP Swap | No symptoms | Yes | 5th June 2020 | 08-Jun-20 | 08/06/20 |
LABE | ETH/SARS001/2025 | COVD0875 | Ethiopia | 34 | M | Targeted testing | NP-OP Swab | Asymptomatic | Yes | 5th June 2020 | 08-Jun-20 | 08/06/20 |
LABF | LBY/SARS001/2027 | COVD00672 | Libya | 18 | Female | Food handler | NP-OP Swab | Fever/Chills, Cough, Headache | Yes | 5th June 2020 | 08-Jun-20 | 08/06/20 |
LABG | MDG/SARS001/2028 | COVD499 | Madagascar | 25 | F | Targeted testing | NP Swab | Asymptomatic | No | 5th June 2020 | 08-Jun-20 | 08/06/20 |
LABH | MUS/SARS001/2029 | COVD078 | Mauritius | 25 | F | Not indicated | NP-OP Swab | Asymptomatic | No | 5th June 2020 | 08-Jun-20 | 08/06/20 |
LABI | SYC/SARS001/2030 | COVD579 | Seychelles | 2 years | Male | Targeted testing | NP Swab | Asymptomatic | No | 5th June 2020 | 08-Jun-20 | 08/06/20 |
LABJ | SOM/SARS001/2031 | 300 | Somalia | 26 | F | Food handler | NP-OP Swab | Asymptomatic | No | 5th June 2020 | 08-Jun-20 | 08/06/20 |
LABK | SSD/SARS001/2031 | COVD00381 | South Sudan | 45 | M | NA | NP-OP Swab | Asymptomatic | Yes | 5th June 2020 | 08-Jun-20 | 08/06/20 |
What are some of the issues you notice with the above metadata?
Answer
- Inconsistent header naming
- Future dates
- Mixed data types within column
- Inconsistent date formats
- Inconsistent sample type capturing
- Inconsistent symptoms
- Spaces in headers
- Use of commas
- ...
Computation-wise, whichever way (poor/good) you choose to organise your data, ensure consistency.
We will demonstrate how to analyze SARS-COV-2 ampliconic data generated by ARTIC protocols and sequenced on Illumina and Oxford Nanopore machines. For the analysis, we will make use of the tutorial developed by galaxy core members:
For analysis of ampliconic ARTIC data, we will make use of the following datasets
https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars1_R1.fastq.gz
https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars1_R2.fastq.gz
https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars2_R1.fastq.gz
https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars2_R2.fastq.gz
https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars3_R1.fastq.gz
https://hpc.ilri.cgiar.org/~douso/AfricaCDC_training/sars3_R2.fastq.gz
First we need to find a good dataset to play with. The Sequence Read Archive (SRA) is the primary archive of unassembled reads operated by the US National Institutes of Health (NIH). SRA is a great place to get the sequencing data that underlie publications and studies. Let’s do that:
-
Go to NCBI’s SRA page by pointing your browser to https://www.ncbi.nlm.nih.gov/sra
-
In the search box enter PRJNA909758 and click Search button. (Alternatively, you simply click on this link)
-
Download metadata describing these datasets by:
- clicking on Send to: dropdown
- Selecting File
- Changing Format to RunInfo
- Clicking Create file
-
This would create a rather large
SraRunInfo.csv
file in yourDownloads
folder.
Now that we have downloaded this file we can go to a Galaxy instance and start processing it.
We will follow these steps, highlighted in https://training.galaxyproject.org/training-material/topics/variant-analysis/tutorials/sars-cov-2/tutorial.html to get a subset of the data:
- Process and filter
SraRunInfo.csv
file in Galaxy - Here the
sra accessions
we will use are:SRR22561688
andSRR22561690
- Download sequencing data with
Faster Download and Extract Reads in FASTQ
https://hpc.ilri.cgiar.org/~gkibet/SARS-ONT/fastq/barcode85.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ONT/fastq/barcode86.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ONT/fastq/barcode87.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ONT/fastq/barcode88.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ONT/fastq/barcode89.fastq.gz
This activity it to help you practice the analysis on illumina data using the Galaxy workflow.
You are encouraged to change the parameters including: sample IDs (names), coverage depth cutoffs,...
Follow the steps as guided we used in the analysis of Illumina sequenced data
https://hpc.ilri.cgiar.org/~jjuma/SARS-CoV-2-ILRI-WHO-2022/testdata-subsampled/COVM02379_S37_con_R1_001.fastq.gz
https://hpc.ilri.cgiar.org/~jjuma/SARS-CoV-2-ILRI-WHO-2022/testdata-subsampled/COVM02379_S37_con_R2_001.fastq.gz
https://hpc.ilri.cgiar.org/~jjuma/SARS-CoV-2-ILRI-WHO-2022/testdata-subsampled/COVM02720_S365_con_R1_001.fastq.gz
https://hpc.ilri.cgiar.org/~jjuma/SARS-CoV-2-ILRI-WHO-2022/testdata-subsampled/COVM02720_S365_con_R2_001.fastq.gz
The data below comes from a MiSeq sequencer
https://hpc.ilri.cgiar.org/~gkibet/SARS-ILL/fastq/sample1_S2_L001_R1_001.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ILL/fastq/sample1_S2_L001_R2_001.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ILL/fastq/sample2_S1_L001_R1_001.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ILL/fastq/sample2_S1_L001_R2_001.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ILL/fastq/sample3_S3_L001_R1_001.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ILL/fastq/sample3_S3_L001_R2_001.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ILL/fastq/sample4_S6_L001_R1_001.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ILL/fastq/sample4_S6_L001_R2_001.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ILL/fastq/sample5_S8_L001_R1_001.fastq.gz
https://hpc.ilri.cgiar.org/~gkibet/SARS-ILL/fastq/sample5_S8_L001_R2_001.fastq.gz