Skip to content

Commit

Permalink
Merge pull request #3 from dmlond/master
Browse files Browse the repository at this point in the history
Actual
  • Loading branch information
jmbvitor committed May 13, 2015
2 parents b247cb3 + f9206e3 commit 0e83994
Show file tree
Hide file tree
Showing 14 changed files with 417 additions and 7 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ packer_cache/
*sai*
*fasta*
*fastq*
*sam*
.*sam*
*samtools*
111 changes: 111 additions & 0 deletions bin/pipeline.docker.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
#!/bin/bash

#make sure this script runs from the ROOT of the project
cd `dirname $0`/..

# this is a simple pipeline that maps FASTQ reads to a reference genome (in FASTA format).

# Here we define where our data reside. Perhaps we may need to modify this depending on
# how we run the pipeline.
DATA=data

# Here we define the number of cores we will use for the calculations. Perhaps we may need
# to modify this depending on the configuration of our VM
CORES=2

# The location of the reference genome in relation to the data folder
REFERENCE=$DATA/Pf3D7_v2.1.5.fasta

# The location of the reads in relation to the data folder
READS_1=$DATA/ERR022523_1.fastq.gz
READS_2=$DATA/ERR022523_2.fastq.gz
FASTQS="$READS_1 $READS_2"

# recreate BWA index if not exists
if [ ! -e $REFERENCE.bwt ]; then
echo "going to index $REFERENCE"

# Warning: "-a bwtsw" does not work for short genomes,
# while "-a is" and "-a div" do not work not for long
# genomes. Please choose "-a" according to the length
# of the genome.
docker-compose run bwa index -a bwtsw $REFERENCE
else
echo "$REFERENCE already indexed"
fi

# lists of produced files. These will be assigned values as we run the pipeline
SAIS=""
SAM=""

# iterate over FASTQ files
for FASTQ in $FASTQS; do

# create new names from the stem of the FASTA and FASTQ files
LOCALFASTA=`echo $REFERENCE | sed -e 's/.*\///'`
LOCALFASTQ=`echo $FASTQ | sed -e 's/.*\///'`
OUTFILE=$DATA/$LOCALFASTQ-$LOCALFASTA.sai

# grow the list of *.sai files
SAIS="$SAIS $OUTFILE"

# create a name for the SAM file
SAM=`echo $OUTFILE | sed -e "s/_.*/-$LOCALFASTA.sam/"`

# note: we don't do basic QC here, because that might mean
# that the mate pairs in the FASTQ files go out of order,
# which will result in the bwa sampe step taking an inordinate
# amount of time

# do bwa aln if needed
if [ ! -e $OUTFILE ]; then
echo "going to align $FASTQ against $REFERENCE"

# use $CORES threads
docker-compose run bwa aln -t $CORES $REFERENCE $FASTQ -f $OUTFILE
else
echo "alignment $OUTFILE already created"
fi
done

# do bwa sampe if needed
if [ ! -e $SAM ]; then

# create paired-end SAM file
echo "going to run bwa sampe $FASTA $SAIS $FASTQS -f $SAM"
docker-compose run bwa sampe $REFERENCE $SAIS $FASTQS -f $SAM
else
echo "sam file $SAM already created"
fi

# do samtools filter if needed
if [ ! -e $SAM.filtered ]; then
# -bS = input is SAM, output is BAM
# -F 4 = remove unmapped reads
# -q 50 = remove reads with mapping qual < 50
echo "going to run samtools view -bS -F 4 -q 50 -o $SAM > $SAM.filtered"
docker-compose run samtools view -bS -F 4 -q 50 -o $SAM.filtered $SAM
docker-compose run gzip -9 $SAM
else
echo "sam file $SAM.filtered already created"
fi

# do samtools sorting if needed
if [ ! -e $SAM.sorted.bam ]; then

# sorting is needed for indexing
echo "going to run samtools sort $SAM.filtered $SAM.sorted"
docker-compose run samtools sort $SAM.filtered $SAM.sorted
else
echo "sam file $SAM.sorted already created"
fi

# created index for BAM file if needed
if [ ! -e $SAM.sorted.bam.bai ]; then

# this should result in faster processing
echo "going to run samtools index $SAM.sorted.bam"
docker-compose run samtools index $SAM.sorted.bam
else
echo "BAM file index $SAM.sorted.bam.bai already created"
fi
19 changes: 19 additions & 0 deletions conf/docker/bwa/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
FROM ubuntu:trusty
MAINTAINER Darin London <[email protected]>

RUN apt-get update \
&& apt-get install -y wget \
&& apt-get install -y bzip2 \
&& apt-get install -y tar \
&& apt-get install -y build-essential \
&& apt-get install -y zlib1g-dev
ADD install_bwa.sh install_bwa.sh
# this downloads the bwa source, makes it, moves it into place, then removes
# the downloads in one transaction to make sure downloads do not remain
# in the image
RUN ./install_bwa.sh

# this creates a default command that gets
# run when the container is run without arguments
# it will print the usage + version of bwa and exit
CMD bwa
12 changes: 12 additions & 0 deletions conf/docker/bwa/install_bwa.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash

# download and extract bwa source
wget -O bwa-0.7.12.tar.bz2 http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.12.tar.bz2/download
tar jxf bwa-0.7.12.tar.bz2
# build bwa and move it into /usr/local/bin
cd bwa-0.7.12
make
mv bwa /usr/local/bin
# clean up to minimize the size of the resulting image
cd ..
rm -rf bwa-0.7.12*
13 changes: 13 additions & 0 deletions conf/docker/raw/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM centos:latest
RUN ["/usr/sbin/useradd", "bwa_user"]
RUN ["/usr/bin/yum", "install", "-y", "wget"]
RUN ["mkdir", "-p", "/home/bwa_user/data"]
RUN ["mkdir","-p","/home/bwa_user/data"]
RUN ["chown","bwa_user","/home/bwa_user/data"]
RUN ["chgrp","bwa_user","/home/bwa_user/data"]
RUN ["chmod","777","/home/bwa_user/data"]
ADD download_plasmodium_raw.sh /usr/local/bin/download_plasmodium_raw.sh
VOLUME ["/home/bwa_user/data"]
WORKDIR /home/bwa_user/data
USER bwa_user
CMD "/usr/local/bin/download_plasmodium_raw.sh"
4 changes: 4 additions & 0 deletions conf/docker/raw/download_plasmodium_raw.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash

wget -O /home/bwa_user/data/ERR022523_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR022/ERR022523/ERR022523_1.fastq.gz
wget -O /home/bwa_user/data/ERR022523_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR022/ERR022523/ERR022523_2.fastq.gz
21 changes: 21 additions & 0 deletions conf/docker/samtools/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM ubuntu:trusty
MAINTAINER Darin London <[email protected]>

RUN apt-get update \
&& apt-get install -y wget \
&& apt-get install -y bzip2 \
&& apt-get install -y gzip \
&& apt-get install -y tar \
&& apt-get install -y build-essential \
&& apt-get install -y zlib1g-dev \
&& apt-get install -y ncurses-dev
ADD install_samtools.sh install_samtools.sh
# this downloads the bwa source, makes it, moves it into place, then removes
# the downloads in one transaction to make sure downloads do not remain
# in the image
RUN ./install_samtools.sh

# this creates a default command that gets
# run when the container is run without arguments
# it will print the usage + version of samtools and exit
CMD samtools
10 changes: 10 additions & 0 deletions conf/docker/samtools/install_samtools.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash
wget -O samtools-1.2.tar.bz2 http://sourceforge.net/projects/samtools/files/samtools/1.2/samtools-1.2.tar.bz2/download
tar jxf samtools-1.2.tar.bz2
# build samtools and move it into /usr/local/bin
cd samtools-1.2
make
mv samtools /usr/local/bin
# clean up to minimize the size of the resulting image
cd ..
rm -rf samtools-1.2*
19 changes: 19 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
bwa:
build: conf/docker/bwa
volumes:
- ./:/wdir
working_dir: /wdir
entrypoint: bwa
command: ''
samtools:
build: conf/docker/samtools
volumes:
- ./:/wdir
working_dir: /wdir
entrypoint: samtools
gzip:
build: conf/docker/samtools
volumes:
- ./:/wdir
working_dir: /wdir
entrypoint: gzip
File renamed without changes.
Binary file added docs/2015-05-12/mindmap_day2-t12.xmind
Binary file not shown.
17 changes: 11 additions & 6 deletions docs/2015-05-13/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,14 +44,19 @@ Schedule
The outline for today is as follows:

- _Session 1_: Recap from yesterday: How Vagrant and Puppet can automate the creation and
configuration of compute environments and how to run analyses inside a VM. Brief
configuration of compute environments and how to run analyses inside a VM. To capture our
understanding of yesterday's progress, we will each make a mindmap with XMind. Give it a
name that includes your computer (t1, t2, etc.), add the file to your git repository and
send us a pull request. This way we have all of them together. Then we will have a brief
aside on how to organize data, e.g. as produced by different runs of a pipeline or
different steps in a larger analysis.
- _Session 2_: If all has gone well, we will be able to access the folder
`arangs2015/data` on the host by navigating to `/vagrant_data/` on the VM. Verify that
this is the case and that you can read from it (e.g. by accessing the README.md using
`more`) as well as write to it (e.g. `touch foo` should create an empty file `foo`).
Modify the pipeline shell script to point it to the right folder and run it.
- _Session 2_: We are going to make our own vagrant box file to share with others. The end
result will be something [like this](https://atlas.hashicorp.com/Naturalis/boxes/arangs2015),
which you can install with `vagrant init Naturalis/arangs2015` (etc.). A box file
is [a combination of the virtual hard drive of the VM and metadata](http://docs.vagrantup.com/v2/boxes/format.html).
This bundling is made using the [packer program](https://packer.io/), which you should install.
The bundler requires some extra scripts and config files, which we will adapt from
[here](https://github.com/hashicorp/atlas-packer-vagrant-tutorial).
- _Session 3_: Docker introduction. We will now begin to look at a newer technology that has emerged within the last few years, Docker. In this session,
we will go over the basic concepts of the Docker system, and get to know its
similarities and differences with Virtualization. We will then learn about the docker ecosystem on registry.docker.hub. We will then install the
Expand Down
119 changes: 119 additions & 0 deletions docs/2015-05-13/intro_docker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
![GTPB](http://gtpb.igc.gulbenkian.pt/bicourses/images/GTPB2015logo.png "GTPB")

Introducing Docker
==================

[Docker](www.docker.com) has some similarities with Virtualization Technologies:

- both involve the creation of reuseable images
- both involve running one or more instances of an image on a Host machine
- images can be transported from one Host to another and run successfully
so long as the hosting software is installed

Docker images differ from Virtualization images in many important ways.

- They are 5-10 times smaller
- They depend on and use much more of the host linux resources
- They are less secure
- Instances are called Containers
- Containers can be instantiated and run within seconds
- Containers can be plugged in to the Host tty, STDIN, STDOUT, and STDERR

The primary difference between a Docker image and a VM image is tied to
a philosophical difference.

VM images are created to host an entire machine architecture which is run as if it were its own machine, completely oblivious to its host.

Docker images are designed to host a single application and its dependencies. They are designed to run on the host as if natively installed. To compose a pipeline, you use or create docker images for each application required, and run containers from the host more or less hooked in to the host, similar to the way you would run a natively installed application.

Docker Ecosystem
----------------

**Docker Machine**

Host systems must install and run the Docker daemon. The daemon can only run on a modern (version created within the last 2 years) Linux Kernal. Almost all flavors of Linux (Fedora, Redhat, Ubuntu, Debian) use the Linux Kernal, and can host the daemon on them natively. Some flavors of \*Nix (Mac OSX in particular), do not use the Linux Kernal. They must run the docker daemon inside a VirtualMachine built on one of the Linux flavors with a modern kernal. This can introduce a bit more complexity, but it also introduces the powerful concept of using external docker hosts 'in the cloud'.

The docker daemon runs a web service in the background and listens to special ports for requests to manage docker images and containers. It provides a REST interface API that can be used by any client. Typically, it uses an encrypted connection called TLS, which is a standard system used by many network client-server communications. TLS requires that each client generate an encrypted certificate (not the same as used by GitHub) to be used when they communicate with the service. The primary client that uses the REST interface is the docker commandline interface.

The [docker-machine](https://docs.docker.com/machine) command automates the process of getting a docker host running on any computer with a supported Virtualization system (Virtualbox and VMware are supported). It makes it much easier to get Docker up and running if you do not have Systems Administration expertise. It does this by:
- downloading a special VM image for a specified VM management system preconfigured to host and run the docker daemon
- generating TLS certificates
- starting and stopping the VM
- Providing an easy way to configure the Environment needed by the Docker commandline interface (see below)
The docker-machine command can also be used to create docker machines on many cloud [hosting systems](https://docs.docker.com/machine/#using-docker-machine-with-a-cloud-provider), which may be attractive to those wanting to purchase more powerful compute environments than are provided by their own machine, or institution.

**Docker**

The [docker commandline interface](https://docs.docker.com/reference/commandline/cli/) is written in the Go programming language. There are versions available for every known operating system (even Windows 10!). It is designed to interface with the Docker Machine daemon over the network using its REST interface. By compartmentalizing the docker interface from the docker machine, it is possible to use the same docker command to interface with a docker machine running anywhere on the network.

The client must run in the context of a special set of Environment variables:
* DOCKER_TLS_VERIFY (1 if using TLS, default)
* DOCKER_CERT_PATH (path to TLS certificate if using TLS)
* DOCKER_HOST (url and port to the Docker Host daemon service)

The docker commandline interface provides the full set of tools needed to create and manage docker images and image container instances.

* pull images from a Docker Registry (it knows about the Official Docker Registry by default)
* push images to a Docker Registry (requires login)
* list images
* build images from a build context (more about this tomorrow)
* remove images
* tag images (acts like an alias)
* run container instances of images
* list containers
* start and stop existing container instances (background only)
* pause/unpause existing containers (foreground and background)
* kill a running container (stop is preferred but kill can be used to stop a runaway container process)
* rm stopped/killed container instances
* inspect container instances (running or stopped)
* Dump the log (STDOUT) from a running container
* save and load a tar file of an image (can be used instead of a registry to move docker images from one machine to another)
* exec a command in a running container (allows you to interact with, and change the state of a running container)

There are many arguments that you can provide to the [Run](https://docs.docker.com/reference/run/) command:
* container naming (docker provides default names to all containers, sometimes humorous), you can specifically name a container at run time
* interactivity mode (interactive or daemon mode)
* attach the host tty (we will demonstrate this) to an interactive container
* mount local directories to the container file system
* connect one container to another container to make a private network between them
* mount volumes from other, special containers, called volume containers, to the container file system
* set the user, group, working directory to be used inside the container
* set environment variables
* override the default entrypoint or command (more on this tomorrow)
* connect host and container STDIN, STDOUT, and STDERR
* expose container ports to the host

**Docker Registry**

Docker has hosted a worldwide [Registry](https://registry.hub.docker.com/) of Docker images. Anyone with docker can share their own images with the world. Images shared on the Docker Registry cannot be private. It is possible to [host your own registry](http://docs.docker.com/registry/deploying/).

The Docker commandline tool is preconfigured to know about and use the official
Docker Registry.

- docker pull i will pull the image i down onto your host
- docker run i will pull the image i down if it is not present, and then run a container of i

Lesson Plan
-----------

- install docker-machine and docker
- explore the Docker Registry
- run some docker images
- with and without docker pull
- with and without local storage
- with exposed ports
- connected to other container systems/services
- inspect information about containers
- inspect the log from running containers
- remove images
- remove containers (with volumes)

Resources
---------
- https://www.docker.com/
- https://docs.docker.com/machine/
- https://docs.docker.com/compose/
- https://docs.docker.com/userguide/
- https://docs.docker.com/reference/commandline/cli/
- https://registry.hub.docker.com
- https://registry.hub.docker.com/u/tutum/hello-world/
Loading

0 comments on commit 0e83994

Please sign in to comment.