a1_child.Rmd

---
title: "Assignment 1: Data Set Selection and Initial Processing"
author: "Priyanka Narasimhan"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, cache = TRUE, fig.asp = 1)
```
#### Data Set Selection and Initial Processing
Downloaded data set GSE136864 from GEO NCBI site, and obtained supplementary file: GSE136864_elf1_counts_matrix.txt.gz
```{r install_packages, include=FALSE}
rm(list = ls())
if (! requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (! requireNamespace("Biobase", quietly = TRUE)) {
  BiocManager::install("Biobase")
}
if (! requireNamespace("GEOquery", quietly = TRUE)) {
  BiocManager::install("GEOquery")
}
# Load
library(BiocGenerics)
library(Biobase)
library(GEOquery)
```


```{r download_dataset, include=FALSE}
gset <- GEOquery::getGEO("GSE136864", GSEMatrix =TRUE)
# Some additional info:
length(gset) # the length is 1
edata <- gset[[1]]
edata # This is our expression matrix
# Our phenodata returns a dataframe that shows our 17 GEO samples, and 49 variable (column) labels
names = (pData(edata))
names
```

```{r, include=FALSE}
supfile <- NULL
sfilename="GSE136864_elf1_counts_matrix.txt.gz" 
# Don't download if supp. file already exists
if (!file.exists(sfilename)) {
    supfile <- getGEOSuppFiles('GSE136864')
 } 
fnames = rownames(supfile)
fnames
```
 
#### 1 - General Information about Platform:

```{r, include=FALSE}
gse <- getGEO("GSE136864", GSEMatrix=FALSE)
current_gpl <- names(GPLList(gse))[1]
current_gpl_info <- Meta(getGEO(current_gpl))
```

**Platform title:** `r current_gpl_info$title` <br />
**Submission date:** `r current_gpl_info$submission_date` <br />
**Last updated date:** `r current_gpl_info$last_update_date` <br />
**Organism:** `r current_gpl_info$organism` <br />
**Number of GEO datasets that use this technology:** `r  length(current_gpl_info$series_id)` <br />
**Number of GEO samples that use this technology:** `r  length(current_gpl_info$sample_id)` <br />

#### 2 - Experimental Design:
In this experiment, the growth of viral infections and host responses were studied in the presence of the ELF1 gene using A549 cells (found in human lung tissue), under different conditions. <br />
A549 cells (which are found in lung tissue), were used in the experiment. These cells were either: <br />
- not-transduced <br />
- or transduced with empty vector, ELF1-WT, or ELF1-R8A mutant. <br />
In addition, indicated cultures were stimulated with interferon beta for 6 hours or 48 hours. All samples were harvested simultaneously and analyzed by RNA-Seq.

Total Samples Collected: 17 <br />
Before filtering the data set: 63678 rows <br />
```{r, include=FALSE}
sfiles = getGEOSuppFiles('GSE136864')
counts = read.delim(rownames(sfiles)[1],header=TRUE, check.names = FALSE)
head(counts)
```

```{r}
dim(counts) # returns [1] 63678    18
```

```{r, include=FALSE}
# Use column 1 in the 'counts' table as the rownames
rownames(counts) <- counts[, 1]
counts
# check for any duplicate rownames, there are no duplicate row names!
any(duplicated(rownames(counts)))
```

Define groups:
```{r}
# Define groups, using 1, 3, and 2 token after split string
# "Treatment" column - tells us which was transduced in this test, it could be either empty, ELF1, cell or R8A
# "trial_num" - replicate number
# Test_run is just a concatenation of the 2 columns Treatment and mock_or_IFN to get the name of a sample test run
samples <- data.frame(lapply(colnames(counts)[2:18],
                             FUN = function(x) {
                               unlist(strsplit(x, split="_"))[c(1, 3, 2)]
                             }))
colnames(samples) <- colnames(counts)[2:18]
rownames(samples) <- c("Treatment", "trial_num", "mock_or_IFN")
samples <- data.frame(t(samples))
samples$Test_run = paste(samples$Treatment,"_",samples$mock_or_IFN)
samples
```

#### 3 - Map to HUGO symbols
Our dataset curently contains 63,678 Ensembl (ENSG) genes which need to be mapped to HUGO symbols. Here are the first 10 rows after HUGO symbol mapping. 
```{r, include=FALSE}
# install and load the necessary libraries
if (! requireNamespace("org.Hs.eg.db", quietly = TRUE)) {
    BiocManager::install("org.Hs.eg.db")
}
suppressMessages(library("AnnotationDbi"))
suppressMessages(library("org.Hs.eg.db"))
# Map to HUGO symbols and add as a new column 'gname' to our table
if (!any(colnames(counts) == 'gname')) {
  counts <- cbind(counts, gname=mapIds(org.Hs.eg.db,keys=rownames(counts),column="SYMBOL",keytype="ENSEMBL",multiVals="first"))
}
# Now check for duplicates in newly added gname column
any(duplicated(counts$gname)) # Yes there are duplicates!
counts
```

#### 4 - Clean the data
We want to do some filtering and cleaning up the data before moving onto normalization. <br />
We have over 63,000 genes to account for, but genes with low counts should be removed prior to downstream analysis, because from a statistical point of view, low counts are not feasible to make reliable judgements. <br />
Based on edgeR recommendations, we shall keep our minumum count as 2, since we have a test sample type with only 2 replicates. <br />

After Cleaning: 14935 rows <br />
```{r}

library(edgeR)
cpms = cpm(counts[,2:18])
rownames(cpms) <- counts[,1]
keep = rowSums(cpms >1) >=2
counts_filtered <- counts[keep,]
dim(counts_filtered) # returns [1] 14935    19
```

```{r, include=FALSE}
# Number of genes that are 'NA', that don't map to anything
no_gene_counts <- sum(is.na(counts_filtered$gname))
no_gene_counts # returns [1] 1327
# Even if a gene is labelled as 'NA' we still keep it, because it has nothing to do with the validity of the gene count. It just means that a gene symbol needs to be requested.
# Before filtering, there were lots of duplicates, but after removing the ones with low counts we look at the duplicates and we barely have any 
summarized_gene_counts <- sort(table(counts_filtered$gname),decreasing = TRUE)
summarized_gene_counts <- which(summarized_gene_counts>1)
summarized_gene_counts
# returns 
# LMF1   RAD54B TRAPPC13 
#   1        2        3
```


#### 5 - Normalize the data

```{r, include=FALSE}
data2plot <- log2(cpm(counts_filtered[,2:18]))
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM", 
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "A549 cells RNASeq Samples, before normalization")
#draw the median on each box plot
abline(h = median(apply(data2plot, 2, median)), col = "red", lwd = 0.9, lty = "dashed")
```

There are outliers in each of our 17 samples. By looking at the boxplot, there are a lot of values far above and below the median. These samples may require more investigation.  When I take a look at the sample table data, I can also see that there are a wide range of values for gene count. 

According to the journal, _An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data_ by , "RNA-seq measures RNA content through digital expression profiling by counting the number of sequencing reads that map to a particular feature (e.g. exon, gene, or transcript). Given the dynamic range of RNA-seq data and practically no ceiling for quantification, extreme high counts (i.e. outliers) for a given feature are often present in one or more RNA samples within an experimental group."

Because of this, I have chosen to leave the outliers as is. 

Let's normalize the data by TTM. I am choosing TTM because, my analysis is between the different samples, and it is said to be one of the methods that perform well both with the ability to detect and control false positives. 
And since my data includes a lot of outliers in both extremes, TTM will allow me to remove the upper and lower percentages.

```{r}
library(edgeR)
filtered_counts_matrix <- as.matrix(counts_filtered[,2:18])
rownames(filtered_counts_matrix) <- counts_filtered$ensembl75_id
d = DGEList(counts=filtered_counts_matrix, group=samples$Transduced)
d = calcNormFactors(d)
normalized_counts <- cpm(d)
```

After normalization..
```{r}
norm_data <- log2(cpm(normalized_counts[,1:17]))
boxplot(norm_data, xlab = "Samples", ylab = "log2 CPM", 
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "A549 cells RNASeq Samples, after normalization")
#draw the median on each box plot
abline(h = median(apply(data2plot, 2, median)), col = "red", lwd = 0.9, lty = "dashed")
```

```{r, include=FALSE}
counts_filtered == normalized_counts # should return FALSE
dim(counts) - dim(counts_filtered)
# we were able to filter out 48743 rows of data from 63678 rows
```

Here are the first 10 rows of our new filtered, normalized data (this table is scrollable until the end):
```{r}
counts_filtered_structured <- counts_filtered
rownames(counts_filtered_structured) <- c()
colnames(counts_filtered_structured)[1] <- "ensembl_gene"
counts_filtered_structured <- counts_filtered_structured[, c(1, 19, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)]
head(counts_filtered_structured, 10)
```