Skip to content

Latest commit

 

History

History
140 lines (100 loc) · 9.12 KB

home.md

File metadata and controls

140 lines (100 loc) · 9.12 KB

Firstly, we need to install some common software, and download some data to help us on our way. For the software we will use a program called 'conda':mag:, which will allow us to easily install lots of common bioinformatics software in a special 'environment' (think of it like a box :package:) without the need for admin access or other complications. For the data, we will download this from some publiic repositories.

NB - If you are attending the Workshop on Genomics then 'conda', the software and all data is already installed for you! So you can continue to the next section called 'Adventure Time' 😃. Although you might like to read the next bit just for reference.

Firstly, we need to enter or create a directory called "workshop_materials" in your home directory and then clone this repository. All further commands will be run within the "genomics_adventure" directory.

cd workshop_materials
# or
mkdir workshop_materials && cd workshop_materials

# clone this repository
git clone https://github.com/guyleonard/genomics_adventure.git

# enter genomics_adventure
cd genomics_adventure

Software

This section will create the 'environment' 📦 in which we will be having our adventure, this allows us to keep all the software in one place for easy access and repeatability (e.g. you may wish to run different versions of software for other analyses, you can do that in other environments). We won't explore each of the programs that we will install right now, but the adventure will explain each as we get to them.

🐜 This time you may copy and paste, one-by-one, the commands below:

# Make sure we are up to date
conda update -n base conda

# Create our environment
conda create --name genomics_adventure python=3.7

# Activate our environment
conda activate genomics_adventure

# add required channels
conda config --add channels bioconda
conda config --add channels conda-forge

# Install the software
conda install -c bioconda bcftools=1.12 bedtools blast bwa ea-utils emboss fastqc igv igvtools pfam_scan qualimap quast=5.0.2 samtools=1.12 seqtk spades sra-tools trim-galore vcftools

If conda is being a PITA, then you might like to try 'mamba' which runs much faster and more smoothly (conda install -c conda-forge mamba).

Make sure that the environment is manually activated everytime you come back to this adventure. You should see '(genomics_adventure)' before your normal terminal prompt. If it is not activated, use the 'activate' command from above.

Data

We will need to retrieve several sets of data for our adventure, this is similar to how you may collate data for your own analyses.

  1. Sequence Data
  • Either directly from a Sequencing Service or from a public access database.
  1. Reference Data
  • If you are lucky to have a reference genome...
  1. Databases
  • PFam-A

We will be working with the bacterial species Escherichia coli as it is a relatively small genome (which makes it easy for the timings of this tutorial), but the techniques you will learn here can be applied to any smaller or larger, and/or Eukaryotic genomes too!

Sequencing Data

Back at your home institute you will likely retrieve your data from either the institute's sequencing service or a private outside provider. However, there is also a wealth 💰 of sequenced genomic data stored in publically accesible repositories such as NCBI's SRA or EMBL-EBI's ENA. These portals are also where you will be required to deposit your sequencing efforts during publication.

For this adventure we will be downloading and processing raw sequencing data. Please note that some sequencing services may provide trimmed or quality assessed reads as part of their standard service, however it is up to you whether you want to use that data directly or process the raw data yourself. Always ask: are their methods directly suited to your analysis needs?

The raw data that we will use for the E. coli genome is available from NCBI or EMBL-EBI with the accession ERR2789854 (they are also archived at the DDBJ-DRA too). This is the same data but it is mirrored between the sites, however each site has a different way of accessing the data. We will focus on NCBI and EMBL-EBI for now.

With NCBI you need to use a tool called 'fastq-dump':mag:, which given an accession and several other options will download the 'fastq' data files - it can be notoriously tricky and difficult at times and has some issues with downloading PacBio data. You can give it a try below if you wish, however the EMBL-EBI downloads will be much faster for this tutorial, so we strongly suggest you start there.

At EMBL-EBI they provide direct links to the 'fastq' files that were submitted to the archive ("Submitted files (FTP)"), and so you can use tools such as 'wget' or 'curl' to retrieve them.

NB - These commands may take a little bit of time to complete depending on your connection (NCBI: ~15 minutes; EMBL-EBI: ~2 minutes), so you might want to skip ahead to the next chapter for some light reading about sequencing technologies and file formats whilst you wait... don't forget to come back soon!

# make a directory for the data and move there
mkdir -p sequencing_data/ecoli && cd sequencing_data/ecoli

# eiether
# fastq-dump from NCBI - slow
fastq-dump --split-files --origfmt --gzip SRR857279

# or
# with wget from EMBL-EBI - faster
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR857/SRR857279/SRR857279_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR857/SRR857279/SRR857279_2.fastq.gz

# make the files read-only so we don't destroy our data accidentally
chmod 444 *.gz

# Now do the same for Chapter 5's Pseudomonas data
cd .. && mkdir pseudomonas_gm41 && cd pseudomonas_gm41

# get the Illumina Data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR491/SRR491287/SRR491287_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR491/SRR491287/SRR491287_2.fastq.gz

# get the PacBio data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/006/SRR1042836/SRR1042836_subreads.fastq.gz

# make the files read-only so we don't destroy our data accidentally
chmod 444 *.gz

Reference Data

We will access the reference data from the National Center for Biotechnology Information (NCBI), check out the links below by Ctrl or Cmd clicking the link to open in a new tab:

Escherichia coli str. K-12 substr. MG1655

There is a lot of information on this page, but the main pieces of information we are interested in are; the genome in FASTA format, and the gene annotations in GFF format. Can you see where these are? 👀

We will now download the data, as we are working with the command line we have already copied the links to the data below for you 🙂. Using the 'wget':mag: command we can download files directly from the web to our local dicrectory. The files are 'gzipped', this means they are compressed to save space, it also allows us to make sure the data has not been corrupted during the transfer. We will also need to unzip them with the program 'gunzip':mag:.

:squirrel: This time you may copy and paste, one-by-one, the commands below:

# Create a directory to store our data
mkdir reference_sequences && cd reference_sequences

# Download the E. coli reference genome in FASTA and GFF formats
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz

# Make an ecoli directory, and move the files there, and then unzip them
mkdir ecoli && mv *.gz ecoli && gunzip ecoli/*.gz

# Change write permissions, so that we can't edit them by accident
chmod -R 444 ecoli/*.fna
chmod -R 444 ecoli/*.gff

Databases

We will need to get the PFam-A database of Hidden Markov Models (HMMS) and an Active Site database from the Pfam website. They are located in an ftp directory. Use the commands below. Make sure you are in the "genomics_adventure" directory.

# create a directory and a sub-directory and move there
mkdir -p db/pfam && cd db/pfam

# Download the HMMs and .dat files needed for Pfam-A
wget http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
wget http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz
wget http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/active_site.dat.gz

# Uncompress the files
gunzip *.gz

When the downloads are finished you may continue on to the adventure by clicking the title below.