We will be using a piece of software called 'conda' 🔍 , this allow the easy installation lots of common bioinformatics software in a special 'environment' (think of it like a box 📦 ) without the need for admin access or other complications.
Since you are attending the Workshop on Genomics the software and all data is already installed for you!
If you are using SSH to log in to the AMI, then you will need to use the "-Y" option with SSH.
ssh -Y ip_address
This enables trusted X11 forwarding, don't worry this just means it will allow us to run graphical user interfaces from the AMI on your local computer.
If you have logged in to the AMI via Guacmole and selected the Desktop interface, then don't worry about that bit...Onwards.
Open a terminal, or use the terminal in your SSH session, and you should find a directory called "workshop_materials" in your home directory where this adventure is cloned. All further commands will be run within the "genomics_adventure" directory.
cd workshop_materials
# enter genomics_adventure
cd genomics_adventure
This 'environment' 📦 in which we will be having our adventure, has all the software pre-installed easy access. So we need to activate our environment to get access to the programs.
# Activate our environment
conda activate genomics_adventure
Make sure that the environment is manually activated everytime you come back to this adventure. You should see '(genomics_adventure)' before your normal terminal prompt. If it is not activated, use the 'activate' command from above.
We will be using several sets of data for our adventure, this is similar to how you may collate data for your own analyses.
- Sequence Data
- Either directly from a Sequencing Service or from a public access database.
- Reference Data
- If you are lucky to have a reference genome...
- Databases
- PFam-A
We will be working with the bacterial species Escherichia coli as it is a relatively small genome (which makes it easy for the timings of this tutorial), but the techniques you will learn here can be applied to any smaller or larger, and/or Eukaryotic genomes too!
Back at your home institute you will likely retrieve your data from either the institute's sequencing service or a private outside provider. However, there is also a wealth 💰 of sequenced genomic data stored in publicly accessible repositories such as NCBI's SRA or EMBL-EBI's ENA. These portals are also where you will be required to deposit your sequencing efforts during publication.
The raw data that we will use for the E. coli genome is available from NCBI or EMBL-EBI with the accession ERR2789854 (they are also archived at the DDBJ-DRA too). This is the same data but it is mirrored between the sites, however each site has a different way of accessing the data. We will focus on NCBI and EMBL-EBI for now.
With NCBI you need to use a tool called 'fastq-dump' 🔍 , which given an accession and several other options will download the 'fastq' data files - it can be notoriously tricky and difficult at times and has some issues with downloading PacBio data. You can give it a try below if you wish, however the EMBL-EBI downloads will be much faster for this tutorial, so we strongly suggest you start there.
At EMBL-EBI they provide direct links to the 'fastq' files that were submitted to the archive ("Submitted files (FTP)"), and so you can use tools such as 'wget' or 'curl' to retrieve them.
Don't worry though, it's all been downloaded for you! :)
We will access the reference data from the National Center for Biotechnology Information (NCBI), check out the links below by Ctrl or Cmd clicking the link to open in a new tab:
Escherichia coli str. K-12 substr. MG1655
There is a lot of information on this page, but the main pieces of information we are interested in are; the genome in FASTA format, and the gene annotations in GFF format. Can you see where these are? 👀
Again, we've done this bit for you! :)
We will be using the PFam-A database of Hidden Markov Models (HMMS) and an Active Site database from the PFAM from InterPro website.
Now you are finished with this section, you may continue on to the adventure by clicking the title below.