hemberg-lab · jianye0383 · Oct 26, 2017 · Oct 26, 2017 · Oct 27, 2017 · Oct 27, 2017
diff --git a/.Renviron b/.Renviron
@@ -0,0 +1 @@
+R_MAX_NUM_DLLS = 250
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,2 @@
+# do not copy git directory
+    .git
diff --git a/.gitignore b/.gitignore
@@ -2,3 +2,15 @@
 .Rhistory
 .RData
 .*.Rnb.cached
+.DS_*
+*/.DS_*
+*.Rproj
+.Rbuildignore
+pars*.rds
+deng.csv
+scimpute_count.txt
+MAGIC_count.csv
+totalCounts_by_cell.rds
+clust.rds
+tung/reads.rds
+tung/umi.rds
diff --git a/01-intro.Rmd b/01-intro.Rmd
diff --git a/02-intro.Rmd b/02-intro.Rmd
@@ -0,0 +1,69 @@
+---
+output: html_document
+---
+
+# Introduction to single-cell RNA-seq
+
+```{r, echo=FALSE}
+library(knitr)
+opts_chunk$set(fig.align = "center", echo=FALSE)
+```
+
+## Bulk RNA-seq
+
+* A major breakthrough (replaced microarrays) in the late 00's and has been widely used since
+* Measures the __average expression level__ for each gene across a large population of input cells
+* Useful for comparative transcriptomics, e.g. samples of the same tissue from different species
+* Useful for quantifying expression signatures from ensembles, e.g. in disease studies
+* __Insufficient__ for studying heterogeneous systems, e.g. early development studies, complex tissues (brain)
+* Does __not__ provide insights into the stochastic nature of gene expression
+
+## scRNA-seq
+
+* A __new__ technology, first publication by [@Tang2009-bu]
+* Did not gain widespread popularity until [~2014](https://www.ohio.edu/bioinformatics/upload/Single-Cell-RNA-seq-Method-of-the-Year-2013.pdf) when new protocols and lower sequencing costs made it more accessible
+* Measures the __distribution of expression levels__ for each gene across a population of cells
+* Allows to study new biological questions in which __cell-specific changes in transcriptome are important__, e.g. cell type identification, heterogeneity of cell responses, stochasticity of gene expression, inference of gene regulatory networks across the cells.
+* Datasets range __from $10^2$ to $10^6$ cells__ and increase in size every year
+* Currently there are several different protocols in use, e.g. SMART-seq2 [@Picelli2013-sb], CELL-seq [@Hashimshony2012-kd] and Drop-seq [@Macosko2015-ix]
+* There are also commercial platforms available, including the [Fluidigm C1](https://www.fluidigm.com/products/c1-system), [Wafergen ICELL8](https://www.wafergen.com/products/icell8-single-cell-system) and the [10X Genomics Chromium](https://www.10xgenomics.com/single-cell/)
+* Several computational analysis methods from bulk RNA-seq __can__ be used
+* __In most cases__ computational analysis requires adaptation of the existing methods or development of new ones
+
+## Workflow
+
+```{r intro-rna-seq-workflow, out.width = '90%', fig.cap="Single cell sequencing (taken from Wikipedia)"}
+knitr::include_graphics("figures/RNA-Seq_workflow-5.pdf.jpg")
+```
+
+Overall, experimental scRNA-seq protocols are similar to the methods used for bulk RNA-seq. We will be discussing some of the most common approaches in the next chapter.
+
+## Computational Analysis
+
+This course is concerned with the computational analysis of the data
+obtained from scRNA-seq experiments. The first steps (yellow) are general for any highthroughput sequencing data. Later steps (orange) require a mix of existing RNASeq analysis methods and novel methods to address the technical difference of scRNASeq. Finally the biological interpretation (blue) __should__ be analyzed with methods specifically developed for scRNASeq.
+
+```{r intro-flowchart, out.width = '65%', fig.cap="Flowchart of the scRNA-seq analysis"}
+knitr::include_graphics("figures/flowchart.png")
+```
+
+There are several reviews of the scRNA-seq analysis available including [@Stegle2015-uv].
+
+Today, there are also several different platforms available for carrying out one or more steps in the flowchart above. These include:
+
+* [Falco](https://github.com/VCCRI/Falco/) a single-cell RNA-seq processing framework on the cloud.
+* [SCONE](https://github.com/YosefLab/scone) (Single-Cell Overview of Normalized Expression), a package for single-cell RNA-seq data quality control and normalization.
+* [Seurat](http://satijalab.org/seurat/) is an R package designed for QC, analysis, and exploration of single cell RNA-seq data.
+* [ASAP](https://asap.epfl.ch/) (Automated Single-cell Analysis Pipeline) is an interactive web-based platform for single-cell analysis.
+* [Bioconductor](https://master.bioconductor.org/packages/release/workflows/html/simpleSingleCell.html) is a open-source, open-development software project for the analysis of high-throughput genomics data, including packages for the analysis of single-cell data.
+
+
+## Challenges
+
+The main difference between bulk and single cell RNA-seq is that each sequencing library represents a single cell, instead of a population of cells. Therefore, significant attention has to be paid to comparison of the results from different cells (sequencing libraries). The main sources of discrepancy between the libraries are:
+
+* __Amplification__ (up to 1 million fold)
+* __Gene 'dropouts'__ in which a gene is observed at a moderate expression level in one cell but is not detected in another cell [@Kharchenko2014-ts].
+
+In both cases the discrepancies are introduced due to low starting amounts of transcripts since the RNA comes from one cell only. Improving the transcript capture efficiency and reducing the amplification bias are currently active areas of research. However, as we shall see in this course, it is possible to alleviate some of these issues through proper normalization and corrections.
+
diff --git a/02-literature.Rmd b/02-literature.Rmd
diff --git a/03-exp-methods.Rmd b/03-exp-methods.Rmd
@@ -0,0 +1,72 @@
+---
+output: html_document
+---
+
+```{r, echo=FALSE}
+library(knitr)
+opts_chunk$set(fig.align = "center", echo=FALSE, out.width = '70%')
+```
+
+## Experimental methods
+
+```{r, fig.cap="Moore's law in single cell transcriptomics (image taken from [Svensson et al](https://arxiv.org/abs/1704.01379))", out.width = '100%'}
+knitr::include_graphics("figures/moores-law.png")
+```
+
+Development of new methods and protocols for scRNA-seq is currently a very active area of research, and several protocols have been published over the last few years. A non-comprehensive list includes:
+
+* CEL-seq [@Hashimshony2012-kd]
+* CEL-seq2 [@Hashimshony2016-lx]
+* Drop-seq [@Macosko2015-ix]
+* InDrop-seq [@Klein2015-kz]
+* MARS-seq [@Jaitin2014-ko]
+* SCRB-seq [@Soumillon2014-eu]
+* Seq-well [@Gierahn2017-es]
+* Smart-seq [@Picelli2014-ic]
+* Smart-seq2 [@Picelli2014-ic]
+* [SMARTer](http://www.clontech.com/US/Products/cDNA_Synthesis_and_Library_Construction/Next_Gen_Sequencing_Kits/Total_RNA-Seq/Universal_RNA_Seq_Random_Primed)
+* STRT-seq [@Islam2014-cn]
+
+The methods can be categorized in different ways, but the two most important aspects are __quantification__ and __capture__. 
+
+For quantification, there are two types, __full-length__ and __tag-based__. The former tries to achieve a uniform read coverage of each transcript. By contrast, tag-based protocols only capture either the 5'- or 3'-end of each RNA. The choice of quantification method has important implications for what types of analyses the data can be used for. In theory, full-length protocols should provide an even coverage of transcripts, but as we shall see, there are often biases in the coverage. The main advantage of tag-based protocol is that they can be combined with unique molecular identifiers (UMIs) which can help improve the quantification (see chapter \@ref(umichapter)). On the other hand, being restricted to one end of the transcript may reduce the mappability and it also makes it harder to distinguish different isoforms [@Archer2016-zq].
+
+The strategy used for capture determines throughput, how the cells can be selected as well as what kind of additional information besides the sequencing that can be obtained. The three most widely used options are __microwell-__, __microfluidic-__ and __droplet-__ based.
+
+```{r, fig.cap="Image of microwell plates (image taken from Wikipedia)"}
+knitr::include_graphics("figures/300px-Microplates.jpg")
+```
+
+For well-based platforms, cells are isolated using for example pipette or laser capture and placed in microfluidic wells. One advantage of well-based methods is that they can be combined with fluorescent activated cell sorting (FACS), making it possible to select cells based on surface markers. This strategy is thus very useful for situations when one wants to isolate a specific subset of cells for sequencing. Another advantage is that one can take pictures of the cells. The image provides an additional modality and a particularly useful application is to identify wells containg damaged cells or doublets. The main drawback of these methods is that they are often  low-throughput and the amount of work required per cell may be considerable.
+
+```{r, fig.cap="Image of a 96-well Fluidigm C1 chip (image taken from Fluidigm)"}
+knitr::include_graphics("figures/fluidigmC1.jpg")
+```
+
+Microfluidic platforms, such as [Fluidigm's C1](https://www.fluidigm.com/products/c1-system#workflow), provide a more integrated system for capturing cells and for carrying out the reactions necessary for the library preparations. Thus, they provide a higher throughput than microwell based platforms. Typically, only around 10% of cells are captured in a microfluidic platform and thus they are not appropriate if one is dealing with rare cell-types or very small amounts of input. Moreover, the chip is relatively expensive, but since reactions can be carried out in a smaller volume money can be saved on reagents.
+
+```{r, out.width = '60%', fig.cap="Schematic overview of the drop-seq method (Image taken from Macosko et al)"}
+knitr::include_graphics("figures/drop-seq.png")
+```
+
+The idea behind droplet based methods is to encapsulate each individual cell inside a nanoliter droplet together with a bead. The bead is loaded with the enzymes required to construct the library. In particular, each bead contains a unique barcode which is attached to all of the reads originating from that cell. Thus, all of the droplets can be pooled, sequenced together and the reads can subsequently be assigned to the cell of origin based on the barcodes. Droplet platforms typically have the highest throughput since the library preparation costs are on the order of $.05$ USD/cell. Instead, sequencing costs often become the limiting factor and a typical experiment the coverage is low with only a few thousand different transcripts detected [@Ziegenhain2017-cu].
+
+## What platform to use for my experiment?
+
+The most suitable platform depends on the biological question at hand. For example, if one is interested in characterizing the composition of a tissue, then a droplet-based method which will allow a very large number of cells to be captured is likely to be the most appropriate. On the other hand, if one is interesting in characterizing a rare cell-population for which there is a known surface marker, then it is probably best to enrich using FACS and then sequence a smaller number of cells.
+
+Clearly, full-length transcript quantification will be more appropriate if one is interested in studying different isoforms since tagged protocols are much more limited. By contrast, UMIs can only be used with tagged protocols and they can facilitate gene-level quantification.
+
+Two recent studies from the Enard group [@Ziegenhain2017-cu] and the Teichmann group [@Svensson2017-op] have compared several different protocols. In their study, Ziegenhain et al compared five different protocols on the same sample of mouse embryonic stem cells (mESCs). By controlling for the number of cells as well as the sequencing depth, the authors were able to directly compare the sensitivity, noise-levels and costs of the different protocols. One example of their conclusions is illustrated in the figure below which shows the number of genes detected (for a given detection threshold) for the different methods. As you can see, there is almost a two-fold difference between drop-seq and Smart-seq2, suggesting that the choice of protocol can have a major impact on the study
+
+```{r, out.width = '60%', fig.cap="Enard group study"}
+knitr::include_graphics("figures/ziegenhainEnardFig1.png")
+```
+
+Svensson et al take a different approach by using synthetic transcripts (spike-ins, more about these later) with known concentrations to measure the accuracy and sensitivity of different protocols. Comparing a wide range of studies, they also reported substantial differences between the protocols.
+
+```{r, out.width = '100%', fig.cap="Teichmann group study"}
+knitr::include_graphics("figures/svenssonTeichmannFig2.png")
+```
+
+As protocols are developed and computational methods for quantifying the technical noise are improved, it is likely that future studies will help us gain further insights regarding the strengths of the different methods. These comparative studies are helpful not only for helping researchers decide which protocol to use, but also for developing new methods as the benchmarking makes it possible to determine what strategies are the most useful ones.
diff --git a/03-method.Rmd b/03-method.Rmd
diff --git a/04-L1-process-raw-QC.Rmd b/04-L1-process-raw-QC.Rmd
@@ -0,0 +1,92 @@
+---
+output: html_document
+code_folding: hide
+---
+
+```{r include=FALSE}
+library('bookdown')
+```
+
+# Processing Raw scRNA-seq Data
+
+## FastQC
+
+Once you've obtained your single-cell RNA-seq data, the first thing you need to do with it is check the quality of the reads you have sequenced. For this task, today we will be using a tool called FastQC. FastQC is a quality control tool for sequencing data, which can be used for both bulk and single-cell RNA-seq data. FastQC takes sequencing data as input and returns a report on read quality. Copy and paste this link into your browser to visit the FastQC website:
+
+https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
+
+This website contains links to download and install FastQC and documentation on the reports produced. Fortunately we have already installed FastQC for you today, so instead we will take a look at the documentation. Scroll down the webpage to 'Example Reports' and click 'Good Illumina Data'. This gives an example of what an ideal report should look like for high quality Illumina reads data.
+
+Now let's make a FastQC report ourselves.
+
+Today we will be performing our analysis using a single cell from an mESC dataset produced by [@Kolodziejczyk2015-xy]. The cells were sequenced using the SMART-seq2 library preparation protocol and the reads are paired end. The files are located in `Share`. 
+
+__Note__ The current text of the course is written for an AWS server for people who attend our course in person. You will have to download the files (both `ERR522959_1.fastq` and `ERR522959_2.fastq`) and create `Share` directory yourself to run the commands. You can find the files here: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-2600/samples/
+
+Now let's look at the files:
+```{bash, eval=FALSE}
+less Share/ERR522959_1.fastq
+less Share/ERR522959_2.fastq
+```
+
+Task 1: Try to work out what command you should use to produce the FastQC report. Hint: Try executing
+
+```{bash, eval=FALSE, collapse=TRUE}
+fastqc -h
+```
+
+This command will tell you what options are available to pass to FastQC. Feel free to ask for help if you get stuck! If you are successful, you should generate a .zip and a .html file for both the forwards and the reverse reads files. Once you have been successful, feel free to have a go at the next section.
+
+
+### Solution and Downloading the Report
+
+If you haven't done so already, generate the FastQC report using the commands below:
+
+```{bash, eval=FALSE, echo = TRUE}
+mkdir fastqc_results
+fastqc -o fastqc_results Share/ERR522959_1.fastq Share/ERR522959_2.fastq
+```
+
+Once the command has finished executing, you should have a total of four files - one zip file for each of the paired end reads, and one html file for each of the paired end reads. The report is in the html file. To view it, we will need to get it off AWS and onto your computer using either filezilla or scp. Ask an instructor if you are having difficulties.
+
+Once the file is on you computer, click on it. Your FastQC report should open. Have a look through the file. Remember to look at both the forwards and the reverse end read reports! How good quality are the reads? Is there anything we should be concerned about? How might we address those concerns?
+
+Feel free to chat to one of the instructors about your ideas.
+
+## Trimming Reads
+
+Fortunately there is software available for read trimming. Today we will be using Trim Galore!. Trim Galore! is a wrapper for the reads trimming software cutadapt.
+
+Read trimming software can be used to trim sequencing adapters and/or low quality reads from the ends of reads. Given we noticed there was some adaptor contamination in our FastQC report, it is a good idea to trim adaptors from our data.
+
+Task 2: What type of adapters were used in our data? Hint: Look at the FastQC report 'Adapter Content' plot.
+
+Now let's try to use Trim Galore! to remove those problematic adapters. It's a good idea to check read quality again after trimming, so after you have trimmed your reads you should use FastQC to produce another report.
+
+Task 3: Work out the command you should use to trim the adapters from our data. Hint 1: You can use 
+
+```{bash, eval=FALSE}
+trim_galore -h
+```
+
+To find out what options you can pass to Trim Galore.
+Hint 2: Read through the output of the above command carefully. The adaptor used in this experiment is quite common. Do you need to know the actual sequence of the adaptor to remove it?
+
+Task 3: Produce a FastQC report for your trimmed reads files. Is the adapter contamination gone?
+
+Once you think you have successfully trimmed your reads and have confirmed this by checking the FastQC report, feel free to check your results using the next section.
+
+### Solution
+
+You can use the command(s) below to trim the Nextera sequencing adapters:
+
+```{bash, eval=FALSE}
+mkdir fastqc_trimmed_results
+trim_galore --nextera -o fastqc_trimmed_results Share/ERR522959_1.fastq Share/ERR522959_2.fastq
+```
+
+Remember to generate new FastQC reports for your trimmed reads files! FastQC should now show that your reads pass the 'Adaptor Content' plot. Feel free to ask one of the instructors if you have any questions.
+
+Congratulations! You have now generated reads quality reports and performed adaptor trimming. In the next lab, we will use STAR and Kallisto to align our trimmed and quality-checked reads to a reference transcriptome.
+
+
diff --git a/04-application.Rmd b/04-application.Rmd