-
Notifications
You must be signed in to change notification settings - Fork 6
/
R_organism_script.Rmd
236 lines (192 loc) · 10.7 KB
/
R_organism_script.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
title: "R_organism_script.Rmd"
author: Sam Westreich, [email protected], github.com/transcript
output: html_document
---
This is a script for processing output files from MG-RAST, converting organism output data into a graph. This program creates a barplot of activity within a sample grouped by organism, both by relative and by absolute activity.
Necessary packages to include:
```{r packages, results='hide'}
library(DESeq2)
library(ggplot2)
library(gridExtra)
library(scales)
library(reshape)
library(knitr)
```
Before starting this script, please adjust the following parameters. The working directory should contain both the experimental sample files (designated by "exp_<filename>") and the control sample files (designated by "control_<filename>").
Note: all output files should have been simplified using the output_reducer Python script to remove headers and other extraneous information.
```{r working_directory}
setwd("~/Desktop/Projects/Lab Stuff/SAMSA_pipeline_v1/public_output_files/public_files_ready_for_R/organism/")
```
Now, we read in the files from the working directory as specified above. The file should be of a format such as:
"control_RefSeq_organism_identifier-name.tab.output-simplified".
Note that if the identifier name or ID of the file is not specified as such, the script will not select the correct section of the file name to use as an ID.
```{r file_listing}
control_files <- list.files(
pattern = "control_*", full.names = T, recursive = FALSE)
control_names = ""
for (name in control_files) {
control_names <- c(control_names, unlist(strsplit(name, split='_', fixed=TRUE))[2])} #change the "2" to a different part of the name if not following standard naming conventions as above
control_names <- control_names[-1]
control_names_trimmed = ""
for (name in control_names) {
control_names_trimmed <- c(control_names_trimmed, unlist(strsplit(name, split='.', fixed=TRUE))[1])}
control_names_trimmed <- control_names_trimmed[-1]
rm (control_names)
exp_files <- list.files(
pattern = "experiment_*", full.names = T, recursive = FALSE)
exp_names = ""
for (name in exp_files) {
exp_names <- c(exp_names, unlist(strsplit(name, split='_', fixed=TRUE))[2])} #again, change 2 if non-standard file names are used
exp_names <- exp_names[-1]
exp_names_trimmed = ""
for (name in exp_names) {
exp_names_trimmed <- c(exp_names_trimmed, unlist(strsplit(name, split='.', fixed=TRUE))[1])}
exp_names_trimmed <- exp_names_trimmed[-1]
rm (exp_names, name)
```
Next, we will load these files in as two tables, one for the experimental data and one for the control data.
```{r open_tables}
# loading the control table
y <- 0
for (x in control_files) {
y <- y + 1
if (y == 1) {
control_table <- read.table(file = x, header = F, quote = "", sep = "\t")
colnames(control_table) = c("DELETE", x, "V3")
control_table <- control_table[,c(2,3)] }
if (y > 1) {
temp_table <- read.table(file = x, header = F, quote = "", sep = "\t")
colnames(temp_table) = c("DELETE", x, "V3") }
if (y > 1) {
control_table <- merge(control_table, temp_table[,c(2,3)], by = "V3", all.x = T) }
}
control_table[is.na(control_table)] <- 0
rownames(control_table) = control_table$V3
control_table_trimmed <- control_table[,-2, drop = FALSE]
# loading the experimental table
y <- 0
for (x in exp_files) {
y <- y + 1
if (y == 1) {
exp_table <- read.table(file = x, header=F, quote = "", sep = "\t")
colnames(exp_table) = c("DELETE", x, "V3")
exp_table <- exp_table[,c(2,3)] }
if (y > 1) {
temp_table <- read.table(file = x, header = F, quote = "", sep = "\t")
colnames(temp_table) = c("DELETE", x, "V3") }
if (y > 1) {
exp_table <- merge(exp_table, temp_table[,c(2,3)], by = "V3", all.x = T) }
}
exp_table[is.na(exp_table)] <- 0
rownames(exp_table) = exp_table$V3
exp_table_trimmed <- exp_table[,-2, drop = FALSE]
```
Now, we simplify the names of the columns of these two tables, using the ID names scrubbed from the filenames above in chunk 3, file_listing. If this step is failing, please note the proper naming of the files as mentioned above chunk 3.
```{r column_names}
colnames(control_table_trimmed) = control_names_trimmed
colnames(exp_table_trimmed) = exp_names_trimmed
```
******** NOTE ********
If we only want the DESeq analysis, skip to line 198, where we perform the differential analysis. The following step instead focuses on creating a visual graph.
******** NOTE ********
We next need to use cbind to restructure these tables so that they can be handled by ggplot.
```{r melting_tables}
control_table_trimmed_m <- melt(cbind(control_table_trimmed,
Genus = rownames(control_table_trimmed)), id.vars = c('Genus'))
exp_table_trimmed_m <- melt(cbind(exp_table_trimmed,
Genus = rownames(exp_table_trimmed)), id.vars = c('Genus'))
```
This script displays an output of the activity of the top 30 organisms, instead of displaying all 1,000+ organisms within the total sample. For that reason, the rest of the organism activity is clumped together into an "other" catchall category.
```{r other_catchall}
control_table_filtered <- control_table_trimmed
control_table_filtered[,"Total"] <- rowSums(control_table_filtered)
control_table_filtered <- control_table_filtered[ with (control_table_filtered, order(-Total)), ]
exp_table_filtered <- exp_table_trimmed
exp_table_filtered[,"Total"] <- rowSums(exp_table_filtered)
exp_table_filtered <- exp_table_filtered[ with (exp_table_filtered, order(-Total)), ]
control_table_filtered["Other",] <- colSums(control_table_filtered[30:nrow(control_table_filtered),])
control_table_filtered <- rbind(control_table_filtered[1:29,], control_table_filtered[nrow(control_table_filtered),])
exp_table_filtered["Other",] <- colSums(exp_table_filtered[30:nrow(exp_table_filtered),])
exp_table_filtered <- rbind(exp_table_filtered[1:29,], exp_table_filtered[nrow(exp_table_filtered),])
control_table_filtered <- control_table_filtered[,-ncol(control_table_filtered), drop = FALSE]
exp_table_filtered <- exp_table_filtered[,-ncol(exp_table_filtered), drop = FALSE]
control_table_filtered_m <- melt(cbind(control_table_filtered,
Genus = rownames(control_table_filtered)), id.vars = c('Genus'))
exp_table_filtered_m <- melt(cbind(exp_table_filtered,
Genus = rownames(exp_table_filtered)), id.vars = c('Genus'))
```
Now, we merge the two tables to create one full table that will be displayed on graphs. Note that the inserted row creates a space on the graph between the control and experimental graphs.
```{r merging_tables}
temprow <- matrix(c(rep.int(0,length(control_table_trimmed_m))),nrow=1,ncol=length(control_table_trimmed_m))
newrow <- data.frame(temprow)
colnames(newrow) <- colnames(control_table_trimmed_m)
newrow$variable <- "<blank>"
newrow$Genus <- "Filler"
full_table <- rbind(control_table_trimmed_m, newrow, exp_table_trimmed_m)
full_table_top30 <- rbind(control_table_filtered_m, newrow, exp_table_filtered_m)
```
Because normal ggplot colors tend to blend together too much when required to select 30 colors, a custom palette is specified here:
```{r palette}
CbPalette <- c("#a6cee3", "#1f78b4", "#b2df8a", "#33a02c", "#fb9a99", "#e31a1c", "#fdbf6f", "#ff7f00", "#cab2d6", "#6a3d9a", "#ffff99", "#b15928", "#8dd3c7", "#ffffb3", "#bebada", "#fb8072", "#80b1d3", "#fdb462", "#b3de69", "#fccde5", "#d9d9d9", "#bc80bd", "#ccebc5", "#ffed6f", "#e41a1c", "#377eb8", "#4daf4a", "#984ea3", "#ff7f00", "#ffff33", "#a65628", "#f781bf", "#999999", "#000000", "#a6cee4", "#1f78b5", "#b2df8b")
```
Now, we create the graphs! First, we create a stacked bar by percentage, allowing us to see relative activity composition by sample.
```{r percentage_plot}
org_relative_ggplot <- ggplot(full_table_top30, aes(x = variable, y = value, fill = Genus)) +
geom_bar(position = "fill", stat = "identity") +
scale_fill_manual(values = CbPalette) +
scale_y_continuous(labels = percent_format()) +
theme(legend.position = "bottom") +
guides(fill = guide_legend(ncol=10)) +
ggtitle("Top 30 gut microorganisms by relative abundance") +
xlab("Sample ID") + ylab("Relative activity of total sample")
```
Next, we create a stacked bar graph by raw counts, showing how total sample counts compare to each other.
```{r absolute_plot}
org_absolute_ggplot <- ggplot(full_table_top30, aes(x = variable, y = value, fill = Genus)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = CbPalette) +
theme(legend.position = "bottom") +
guides(fill = guide_legend(ncol=10)) +
ggtitle("Top 30 gut microorganisms by total abundance") +
xlab("Sample ID") + ylab("Total reads per sample")
```
Finally, we display both of these graphs on a single display!
```{r graph_output}
grid.arrange(org_relative_ggplot, org_absolute_ggplot)
```
******** DESEQ ANALYSIS ********
And lastly, we can run DESeq2 to get some statistical data on most significantly different organisms, by expression level, between our experimental and control samples.
```{r DESeq2_analysis}
library(DESeq2)
complete_table <- merge(control_table_trimmed, exp_table_trimmed, by=0, all = TRUE)
complete_table[is.na(complete_table)] <- 1
rownames(complete_table) <- complete_table$Row.names
complete_table <- complete_table[,-1]
complete_table_col_names = ""
i = 0
while (i < (length(control_files))) {
i <- i + 1
complete_table_col_names <- append(complete_table_col_names, paste("control", as.character(i), sep = "")) }
i = 0
while (i < length(exp_files)) {
i <- i + 1
complete_table_col_names <- append(complete_table_col_names, paste("experimental", as.character(i), sep = "")) }
complete_table_col_names <- complete_table_col_names[-1]
completeCondition <- data.frame(condition=factor(c(rep("control", length(control_files)), rep("experimental", length(exp_files)))))
colnames(complete_table) <- complete_table_col_names
dds <- DESeqDataSetFromMatrix(complete_table, completeCondition, ~ condition)
dds <- DESeq(dds)
baseMeanPerLvl <- sapply( levels(dds$condition), function(lvl) rowMeans( counts(dds,normalized=TRUE)[,dds$condition == lvl] ) )
res <- results(dds, contrast = c("condition", "experimental", "control"))
org_results <- data.frame(res)
org_results <- merge(org_results, baseMeanPerLvl, by="row.names")
org_results <- org_results[,c(1,2,8,9,3,4,5,6,7)]
colnames(org_results)[c(3,4)] <- c("controlMean", "experimentalMean")
plotMA(dds, ylim=c(-2.5,2.5), main="DESeq2 analysis of all organisms")
plotCounts(dds, gene=which.min(res$padj), intgroup="condition")
sorted_org_results <- org_results[order(-org_results$baseMean),]
colnames(sorted_org_results)[1] <- "Organism Name"
write.table(sorted_org_results, file = "DESeq_public_data_organism_output.tab", append = FALSE, quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)
```
This should now have a file saved of all DESeq comparisons, labeled as DESeq_organism_output.tab, in the working directory.