intro.Rmd

---
title: "Technology Overview"
author: "Mark Dunning; mark 'dot' dunning 'at' cruk.cam.ac.uk, Oscar Rueda; oscar 'dot' rueda 'at' cruk.cam.ac.uk"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output: html_document

---


#Historical overview


Two-colour microarrays were used to compare two samples (e.g. cancer
and normal cells) on the same microarray. The RNA from the two samples
is extracted separately and fluorescently labelled with different dyes, usually
red and green. Therefore, after hybridisation, each feature is a mixture of
red and green fluorescence. A completely red or green feature indicates that
a particular gene is expressed in one sample, but not the other. 


![probe-and-target](https://upload.wikimedia.org/wikipedia/en/c/c8/Microarray-schema.jpg)


Single-channel microrrays can also be produced to measure the absolute
expression level of every gene of interest in a given sample. Therefore, the
fluorescence of each feature is a measure of the expression level of a particular gene. Arguably the most popular single-channel microarray
technology was that of Affymetrix. As we will describe in the next section, these arrays use 25 base-pair probes that are synthesised on the array surface. Each gene of interest is interrogated by a collection of 11-20 probe pairs, known as a probe set. The expression level for a gene is then derived by combining all measurements from a particular probe set.

Illumina (who are probably best-known for their sequencing technologies these days), were also a major player. [Until recently](http://core-genomics.blogspot.co.uk/2014/08/seqc-kills-microarrays-not-quite.html), all gene-expression arrays at the CRUK Cambridge Institute were run on Illumina arrays and we have much experience in conducting [large-scale cancer studies](http://www.nature.com/nature/journal/v486/n7403/full/nature10983.html) and developing [software](http://bioinformatics.oxfordjournals.org/content/23/16/2183.full) and [analysis methods](http://nar.oxfordjournals.org/content/38/3/e17.long) for these arrays.


## Typical workflow

Despite differences in array construction, there are a few commonalities in the way that raw data from a microarray experiment are processed. 

### Image Processing

A microarray surface is typically scanned by a laser to produce an image
representation of the fluorescence emitted by it. Thus, depending on the
resolution of the scanner, each feature will be represented by a number of
pixels. These are known as the raw images and are usually in the 16-bit
TIFF image format. Therefore, the intensity of each pixel is a value in the
range 0 to 2^16 − 1. 


```{r echo=FALSE,message=FALSE,fig.align='left',fig.cap="A high-resolution TIFF image is the result of scanning the array surface"}
library(affy)
targetsFile <- "estrogen/estrogen.txt"
pd <- read.AnnotatedDataFrame(targetsFile,header=TRUE,sep="",row.names=1)

raw <-ReadAffy(celfile.path = "estrogen", filenames=rownames(pData(pd))[1],phenoData = pd[1,])
image(raw,main="")
```

These images are usually processed by the manufacturers' software, which involves locating all the features on the image and then
calculating foreground intensities using the pixels that make up each feature. However, the pixel intensities measured on the image may be influenced by
factors other than hybridisation, such as optical noise from the scanner or
foreign items deposited on the array. Therefore, a background intensity is
estimated for each feature to account for such factors. The background and
foreground estimates generally act as a starting point for statistical analysis.

![close-up](images/imageprocessing.png)

### Data processing
The intensities of the features on a microarray are influenced by many sources of noise and repeated measurements made on different microarrays may also appear to disagree. Therefore, a number of data-cleaning, or pre-processing steps, must take place before being able to draw valid biological conclusions from a microarray experiment ([Quackenbush, 2002](http://www.nature.com/ng/journal/v32/n4s/full/ng1032.html); [Smyth et al., 2003](http://www.statsci.org/smyth/pubs/Smyth-MethodsMolecularBiology-2003.pdf); [Allison
et al., 2006](http://www.nature.com/nrg/journal/v7/n1/full/nrg1749.html))

- Background correction
    + A separate step to the correction for chip surface anomalies discussed above
    + Microarray probes are affected by cross hybridisation and other noise sources. The baseline measurements from the instrument are never zero
    + We address this by measuring 'negative control' probes that we don't expect to yield any signal
    + Affymetrix and Illumina have different ways of doing this
- Quality assessment
    + Some chips might be dodgy or we may have used poor-quality samples
- Transformation
    + The TIFF images yield values on the scale 0 to 2^16, which is not very convenient for analysis
    + A suitable transformation, such as log$_2$ is often used so a change of 1 unit in this corresponds to a two-fold change.
- Normalisation
    + Systematic effects may emerge over time which we need to calibrate
    + Experiments should be adequately-designed to cope with this
- Annotation
    + Microarry manufacturers use their own identifier schemes that don't relate to biology
        + "ILMN_1343291", "ILMN_1343295",...
        + "1000_at", "1001_at",....
    + We need to map these IDs to gene names, genome position etc.
    + Sometimes they mappings can be wrong, like for [Affy](http://nar.oxfordjournals.org/content/33/20/e175.abstract) and [Illumina](http://nar.oxfordjournals.org/content/38/3/e17.long)
    + Actually, this is a main reason why sequencing is better
        
        
##Microarrays vs sequencing

- Probe design issues
- Limited number of novel findings
- Genome coverage
- On the other-hand, microarray analysis methods are well-understood and established pipelines can process the data quickly and efficiently
- (although sequencing (particularly RNA-seq) is catching-up)

##Are arrays still relevant?

- Wealth of data available online e.g. on [G.E.O](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL10558)
- Useful as a validation platform
- Methods are established and well-understood
- Cheaper? And easier access to equipment


The "death" of microarrays was predicted as early as [2008](http://www.nature.com/news/2008/081015/full/455847a.html). In reality, it took quite a lot longer for arrays to be come obsolete. We have recently reached the [tipping point](http://core-genomics.blogspot.co.uk/2014/08/seqc-kills-microarrays-not-quite.html) where RNA-seq has taken over from gene expression arrays.

There is a vast amount of [Illumina](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL10558) and [Affymetrix](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570) data out there waiting to be explored. Some studies often use these historical samples as validation of computational methods of cancer subtypes. e.g.  [here](http://jnci.oxfordjournals.org/content/107/1/dju357.abstract?cited-by=yes;107/1/dju357) or [here](http://www.genomebiology.com/2014/15/8/431)


### Many of the same issues and techniques apply to NGS data

- Experimental Design; despite this fancy new technolgy, if we don't design the experiments properly we won't get meaningful conclusions
- Quality assessment; Yes, NGS experiments can still go wrong!
- Normalisation; NGS data come with their own set of biases and error that need to be accounted for
- Stats; testing for RNA-seq is built-upon the knowledge from microarrays
    + some of the analysis we do for proteomics uses the linear modelling approach to be described later

Microarray data are much more manageable in size. We can work with decent-sized experiments (~100s of samples) and learn about high-dimensional analysis techniques that you will encounter in the analysis of newer, sexier, technologies.

## The Bioconductor project

![BioC](images/logo_bioconductor.png)

-  Packages analyse all kinds of Genomic data (>800)
- Compulsory documentation (*vignettes*) for each package
- 6-month release cycle
- [Course Materials](http://bioconductor.org/help/course-materials/)
- [Example data](http://bioconductor.org/packages/release/BiocViews.html#___ExperimentData) and [workflows](http://bioconductor.org/help/workflows/)
- Common, re-usable framework and functionality
- [Available Support](https://support.bioconductor.org/)
    + Often you will be able to interact with the package maintainers / developers and other power-users of the project software
- Annual conferences in U.S and Europe
    - The 2015 European conference was in [Cambridge](https://sites.google.com/site/eurobioc2015/)
    
Many of the packages are by well-respected authors and get lots of citations.

![citations](images/citations.png)

##Downloading a package

Each package has its own landing page. e.g. http://bioconductor.org/packages/release/bioc/html/beadarray.html. Here you'll find;

- Installation script (will install all dependancies)
- Vignettes and manuals
- Details of package maintainer
- After downloading, you can load using the `library` function. e.g. `library(beadarray)`

##Reading data using Bioconductor

Recall that data can be read into R using `read.csv`, `read.delim`, `read.table` etc. Several packages provided special modifications of these to read raw data from different manufacturers

- `limma` for various two-colour platforms
- `affy` for Affymetrix data
- `beadarray`, `lumi`, `limma` for Illumina BeadArray data
- A common class is used to represent the data


A dataset may be split into different components

- Matrix of expression values
- Sample information
- Annotation for the probes

In Bioconductor we will often put these data the same object for easy referencing. The `Biobase` package has all the code to do this.