forked from bioinformatics-core-shared-training/Bitesize-R
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsession5.Rmd
814 lines (616 loc) · 27.4 KB
/
session5.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
---
title: "Week 5 -- Data manipulation with dplyr"
---
> #### Learning objectives
>
> * Chain operations together into a workflow using pipes
> * Learn the key **dplyr** functions for manipulating your data
> * Learn about faceting in ggplot2 to split your data into separate categories
> and create a series of sub-plots arranged in a grid
---
# Data manipulation with dplyr
**dplyr** is one of the packages that gets loaded as part of the tidyverse.
```{r}
library(tidyverse)
```
dplyr is the Swiss army knife in the tidyverse, providing many useful functions
for manipulating tabular data in data frames or tibbles. We're going to look at
the key functions for filtering our data, modifying the contents and computing
summary statistics.
We'll also introduce the pipe operator, **`%>%`**, for chaining operations
together into mini workflows in a way that makes for more readable and
maintainable code.
Finally, we'll return to plotting and look at a powerful feature of ggplot2,
**faceting**, that allows you to divide your plots into subplots by splitting
the observations based on one or more categorical variables.
We'll again use the METABRIC data set to illustrate how these operations work.
```{r message = FALSE}
metabric <- read_csv("data/metabric_clinical_and_expression_data.csv")
metabric
```
---
# dplyr verbs
The dplyr operations are commonly referred to as "verbs" in a data manipulation
grammar. These verbs have a common syntax and work together in a consistent and
uniform manner. They all have the following shared behaviours:
* The first argument in each function is a data frame (or tibble)
* Any additional arguments describe what operation to perform on the data frame
* Variable names, i.e. column names, are referred to without using quotes
* The result of an operation is a new data frame
---
# Chaining operations using `%>%`
Let's consider again an earlier example in which we filtered the METABRIC data
set to retain just the patients who were still alive at the time of the study
and had survived for more than 10 years (120 months). We use `filter()` to
select the rows corresponding to the patients meeting these criteria and can
then use `select()` to only display the variables (columns) we're most
interested in.
```{r}
patients <- filter(metabric,
Survival_status == "LIVING" & Survival_time > 120)
patient_details <- select(patients,
Patient_ID,
Survival_time,
Tumour_stage,
Nottingham_prognostic_index)
patient_details
```
Here we've used an intermediate variable, `patients`, which we only
needed in order to get to the final result. We could just have used the same
name to avoid cluttering our environment and overwritten the results from the
`filter()` operation with those of the `select()` operation.
```{r eval = FALSE}
patients <- select(patients,
Patient_ID,
Survival_time,
Tumour_stage,
Nottingham_prognostic_index)
```
In order to avoid creating intermediate objects, another less readable way of
writing this code is to nest the `filter()` function call inside the
`select()`. Nested function calls are very common in a lot of R code you may
come across. For example, imagine we have a vector of numbers and we want to
find the square root of the mean. We could do it two steps:
```{r}
x <- c(24, 53, 352, 643, 46)
myMean <- mean(x)
sqrt(myMean)
```
Or we can nest the commands inside each other:
```{r}
sqrt(mean(x))
```
This is convenient for simple commands, but is unwieldy and not easy to read
with more complex commands:
```{r}
patients <- select(filter(metabric, Survival_status == "LIVING", Survival_time > 120), Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
nrow(patients)
```
However, there is another way chaining together a series of operations into a
mini workflow that is elegant, intuitive and makes for very readable R code.
For that we need to introduce a new operator, the **pipe** operator, **`%>%`**.
```{block type = "rmdblock"}
**The pipe operator `%>%`**
The pipe operator takes the output from one operation, i.e. whatever is on the
left-hand side of `%>%` and passes it in as the first argument to the second
operation, or function, on the right-hand side.
**`x %>% f(y)`** is equivalent to **`f(x, y)`**
For example:
`select(starwars, name, height, mass)`
can be rewritten as
`starwars %>% select(name, height, mass)`
This allows for chaining of operations into workflows, e.g.
`starwars %>%`<br/>
`filter(species == "Droid") %>%`<br/>
`select(name, height, mass)`
The `%>%` operator comes from the `magrittr` package and is available when we
load the tidyverse using `library(tidyverse)`.
Piping in R was motivated by the Unix pipe, `|`, in which the output from one
process is redirected to be the input for the next. This is so named because the
flow from one process or operation to the next resembles a pipeline.
NOTE: The `%>%` operator is from the packages **magrittr**. R now has a native
pipe operator, `|>`, but we'll stick with the **`%>%`** operator as it is
more widely used and has a few advantages over the native pipe.
```
We can rewrite the code for our filtering and column selection operations as
follows.
```{r}
patients <- metabric %>%
filter(Survival_status == "LIVING" & Survival_time > 120) %>%
select(Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
```
Note how each operation takes the output from the previous operation as its
first argument. This way of coding is embraced wholeheartedly in the tidyverse
hence almost every tidyverse function that works on data frames has the data
frame as its first argument. It is also the reason why tidyverse functions
return a data frame regardless of whether the output could be recast as a vector
or a single value.
"Piping", the act of chaining operations together, becomes really useful when
there are several steps involved in filtering and transforming a data set.
The usual way of developing a workflow is to build it up one step at a time,
testing the output produced at each stage. Let's do that for this case.
We start by considering just the patients who are living.
```{r}
patients <- metabric %>%
filter(Survival_status == "LIVING")
```
We then add another filter for the survival time.
```{r}
patients <- metabric %>%
filter(Survival_status == "LIVING") %>%
filter(Survival_time > 120)
```
At each stage we look at the resulting `patients` data frame to check
we're happy with the result.
Finally we only want certain columns, so we add a `select()` operation.
```{r}
patients <- metabric %>%
filter(Survival_status == "LIVING") %>%
filter(Survival_time > 120) %>%
select(Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
# print out the result
patients
```
When continuing our workflow across multiple lines, we need to be careful to
ensure the `%>%` is at the end of the line. If we try to place this at the start
of the next line, R will think we've finished the workflow prematurely and will
report an error at what it considers the next statement, i.e. the line that
begins with `%>%`.
```
# R considers the following to be 2 separate commands, the first of which ends
# with the first filter operation and runs successfully.
# The second statement is the third and last line, is not a valid commmand and
# so you'll get an error message
patients <- metabric %>%
filter(Survival_status == "LIVING")
%>% filter(Survival_time > 120)
```
This is very similar to what we saw with adding layers and other components to
a ggplot using the `+` operator, where we needed the `+` to be at the end of a
line.
We'll be using the pipe `%>%` operator throughout the rest of the course so
you'd better get used to it. But actually we think you'll come to love it as
much as we do.
---
# Modifying data using `mutate()`
In one of the examples of filtering observations using `filter()` above, we
created a new logical variable called `Deceased`.
```{r}html_document"
metabric$Deceased <- metabric$Survival_status == "DECEASED"
```
This is an example of a rather common type of data manipulation in which we
create a new column based on the values contained in one or more other columns.
The `dplyr` package provides the `mutate()` function for this purpose.
```{r}
metabric <- mutate(metabric, Deceased = Survival_status == "DECEASED")
```
Both of these methods adds the new column to the end.
Note that variables names in the `mutate()` function call do not need to be
prefixed by `metabric$` just as they don't in any of the `dplyr` functions.
We can overwrite a column and this is quite commonly done when we want to change
the units. For example, our tumour size values are in millimetres but the
formula for the Nottingham prognostic index needs the tumour size to be in
centimetres. We can convert to `Tumour_size` to centimetres by dividing by 10.
```{r}
metabric <- mutate(metabric, Tumour_size = Tumour_size / 10)
```
```{block type = "rmdblock"}
**Nottingham Prognostic Index**
The Nottingham prognostic index (**NPI**) is used to determine prognosis
following surgery for breast cancer. Its value is calculated using three
pathological criteria: the size of the tumour, the number of lymph nodes
involved, and the grade of the tumour.
The formula for the Nottingham prognostic index is:
**`NPI = (0.2 * S) + N + G`**
where
* **S** is the size of the tumour in centimetres
* **N** is the node status (0 nodes = 1, 1-3 nodes = 2, >3 nodes = 3)
* **G** is the grade of tumour (Grade I = 1, Grade II = 2, Grade III = 3)
```
We'll recalculate the Nottingham prognostic index using the values
`Tumour_size`, `Neoplasm_histologic_grade` and `Lymph_node_status` columns in
our METABRIC data frame.
```{r}
metabric <- metabric %>%
mutate(NPI = (0.2 * Tumour_size) +
Lymph_node_status +
Neoplasm_histologic_grade)
select(metabric,
Tumour_size,
Lymph_node_status,
Neoplasm_histologic_grade,
NPI,
Nottingham_prognostic_index)
```
There is a discrepency. Can you see what the problem is?
It appears that the tumour size wasn't correctly converted into centimetres in
the original NPI calculation or that the wrong scaling factor for tumour size
was used.
```{r scatter_plot_1}
ggplot(data = metabric) +
geom_point(mapping = aes(x = Age_at_diagnosis, y = NPI), na.rm = TRUE)
```
The **`round()`** function is really useful for rounding numerical values to a
specified number of decimal places. We'll read in the METABRIC data again and
create a small workflow that carries out the tumour size conversion, computes
the NPI, rounds the tumour size and the resulting NPI value to 1 decimal place
and displays the results in decreasing order of NPI.
```{r message = FALSE}
read_csv("data/metabric_clinical_and_expression_data.csv") %>%
mutate(Tumour_size = Tumour_size / 10) %>%
mutate(NPI = (0.2 * Tumour_size) +
Lymph_node_status +
Neoplasm_histologic_grade) %>%
mutate(Tumour_size = round(Tumour_size, digits = 1)) %>%
mutate(NPI = round(NPI, digits = 1)) %>%
arrange(desc(NPI)) %>%
select(Tumour_size, Lymph_node_status, Neoplasm_histologic_grade, NPI)
```
## Mutating multiple columns
In that last workflow we included the same rounding operation applied to two
different variables. It would be nice to be able to carry out just the one
`mutate()` but apply it to both `Tumour_size` and `NPI` columns and we can
using **`across()`**.
`across` is used inside `mutate` and takes two arguments, first a selection of
columns to operate across and second a function to apply to those columns.
So if we just wanted to round to the the nearest integer:
```{r}
metabric %>%
mutate(across(c(Tumour_size, NPI), round)) %>%
select(Patient_ID, Tumour_size, NPI)
```
This applies the function `round` across all columns specified in the first
argument for `across`.
This is nice and simple but things can get a little more complex if we want more
than just a straightforward function.
We may come across situations where we'd like to apply the same operation to
multiple columns but where there is no available function in R to do what we
want or situations we need to pass extra arguments to our function.
In these cases the syntax get a little trickier. We need to write what is known
as an "anonymous function".
Let's say we want to convert the petal and sepal measurements in the `iris`
data set from centimetres to millimetres. We'd either need to create a new
function to do this conversion or we could use what is known as an anonymous
function.
There is no 'multiply by 10' function and it seems a bit pointless to create
one just for this conversion so we'll use an anonymous function instead --
anonymous because it has no name, it's an _in situ_ function only used in our
`mutate()` function call.
```{r}
iris %>%
as_tibble() %>%
mutate(across(Sepal.Length:Petal.Width, ~ .x * 10))
```
The **`~`** denotes that we're using an anonymous function (it is the symbol for
formulae in R) and the `.x` is a placeholder for the column being operated on. In
this case, we're multiplying each of the columns between `Sepal.Length` and
`Petal.Width` inclusive by 10.
If we wish to use an existing function but provide additional arguments, then
the syntax is:
`~func(.x, ...)`
Where `.x` is going to be replaced with the column names and `...` is the extra
arguments we need to add.
So if we want to round to 1 decimal place:
```{r}
metabric %>%
mutate(across(c(Tumour_size, NPI), ~round(.x, digits = 1))) %>%
select(Patient_ID, Tumour_size, NPI)
```
We can use the range operator and the same helper functions as we did for
selecting columns using `select()` inside `across()`.
For example, we might decide that our expression values are given to a much
higher degree of precision than is strictly necessary.
```{r}
metabric %>%
mutate(across(ESR1:MLPH, ~round(.x, digits = 2))) %>%
select(Patient_ID, ESR1:MLPH)
```
Or we could decide that all the columns whose names end with "\_status" are in
fact categorical variables and should be converted to factors.
```{r}
metabric %>%
mutate(across(ends_with("_status"), as.factor)) %>%
select(Patient_ID, ends_with("_status"))
```
---
# Computing summary values using `summarise()`
We'll cover the fifth of the main dplyr 'verb' functions, **`summarise()`**,
only briefly here. This function computes summary values for one or more
variables (columns) in a table. Here we will summarise values for the entire
table but this function is much more useful in combination with `group_by()`
in working on groups of observations within the data set. We will look at
summarizing groups of observations next week.
Any function that calculates a single scalar value from a vector can be used
with `summarise()`. For example, the `mean()` function calculates the arithmetic
mean of a numeric vector. Let's calculate the average ESR1 expression for tumour
samples in the METABRIC data set.
```{r}
mean(metabric$ESR1)
```
The equivalent operation using `summarise()` is:
```{r}
summarise(metabric, mean(ESR1))
```
If you prefer Oxford spelling, in which -ize is preferred to -ise, you’re in
luck as dplyr accommodates the alternative spelling.
Both of the above statements gave the same average expression value but these
were output in differing formats. The `mean()` function collapses a vector to
single scalar value, which as we know is in fact a vector of length 1. The
`summarise()` function, as with most tidyverse functions, returns another data
frame, albeit one in which there is a single row and a single column.
Returning a data frame might be quite useful, particularly if we’re summarising
multiple columns or using more than one function, for example computing the
average and standard deviation.
```{r}
summarise(metabric, ESR1_mean = mean(ESR1), ESR1_sd = sd(ESR1))
```
Notice how we also named the output columns in this last example.
```{block type = "rmdblock"}
**`summarise()`**
`summarise()` collapses a data frame into a single row by calculating summary
values of one or more of the columns.
It can take any function that takes a vector of values and returns a single
value. Some of the more useful functions include:
* Centre: **`mean()`**, **`median()`**
* Spread: **`sd()`**, **`mad()`**
* Range: **`min()`**, **`max()`**, **`quantile()`**
* Position: **`first()`**, **`last()`**
* Count: **`n()`**
Note the `first()`, `last()` and `n()` are only really useful when working on
groups of observations using **`group_by()`**.
**`n()`** is a special function that returns the number of observations; it
doesn't take a vector argument, i.e. a column.
```
It is also possible to summarise using a function that takes more than one
value, i.e. from multiple columns. For example, we could compute the correlation
between the expression of FOXA1 and MLPH.
```{r}
summarise(metabric, correlation = cor(FOXA1, MLPH))
```
## Summarizing multiple columns
As with `mutate()` we can use **across()** to summarise across multiple columns
using the same function.
```{r}
summarise(metabric, across(c(FOXA1, MLPH), mean))
```
We can use the `everything()` selector to summarise across all the columns in
the table.
```{r}
metabric %>%
select(ESR1:MLPH) %>%
summarise(across(everything(), mean))
```
But, you have to be careful that all columns can be summarised with the given
summary function. For example, what happens if we try to compute an average of
a set of character values?
```{r}
metabric %>%
summarise(across(everything(), ~mean(.x, na.rm = TRUE)))
```
We get a lot of warning messages and `NA` values for those columns for which
computing an average does not make sense.
We can combine `across()` with `where()` and a logical test (e.g. `is.numeric`)
to select those columns for which a summarization function is appropriate.
```{r}
metabric %>%
summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE)))
```
It is possible to summarise using more than one function in which case a list of
functions needs to be provided.
```{r}
metabric %>%
summarise(across(c(ESR1, ERBB2, PGR), list(mean, sd)))
```
Pretty neat but I'm not sure about those column headings in the output --
fortunately we have some control over these.
```{r}
metabric %>%
summarise(across(c(ESR1, ERBB2, PGR), list(Average = mean, stdev = sd)))
```
Anonymous functions can also be used with summarise.
For example, suppose we want to compute the correlation of the
expression for FOXA1 against all other genes to see which was most strongly
correlated. Here is how we could do this in a single `summarise()` statement
using an anonymous function.
```{r}
metabric %>%
summarise(across(c(ESR1:MLPH, -FOXA1), ~cor(.x, FOXA1)))
```
Notice how we selected all genes between ESR1, the first gene column in our
data frame, and MLPH, the last gene column, but then excluded FOXA1 as we're not
all that interested in the correlation of FOXA1 with itself (we know the answer
is 1).
---
# Faceting with ggplot2
Finally, let's change track completely and take a look at a very useful feature
of ggplot2 -- **faceting**.
Faceting allows you to split your plot into subplots, or facets, based on one
or more categorical variables. Each of the subplots displays a subset of the
data.
There are two faceting functions, **`facet_wrap()`** and **`facet_grid()`**.
Let's create a scatter plot of GATA3 and ESR1 expression values where we're
displaying the PR positive and PR negative patients using different colours.
This is a very similar to a plot we created last week.
```{r scatter_plot_2}
ggplot(data = metabric,
mapping = aes(x = GATA3, y = ESR1, colour = PR_status)) +
geom_point(size = 0.5, alpha = 0.5)
```
An alternative is to use faceting with **`facet_wrap()`**.
```{r scatter_plot_3}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(PR_status))
```
This produces two plots, side-by-side, one for each of the categories in the
`PR_status` variable, with a banner across the top of each for the category.
The variable(s) used for faceting are specified using `vars()` in a similar way
to the selection of variables for `mutate_at()` and `summarise_at()`.
We can still use separate colours if we prefer things to be, well, colourful.
```{r scatter_plot_4}
ggplot(data = metabric,
mapping = aes(x = GATA3, y = ESR1, colour = PR_status)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(PR_status))
```
Faceting is usually better than displaying groups using different colours when
there are more than two or three groups when it can be difficult to really tell
which points belong to each group. A case in point is for the 3-gene
classification in the GATA3 vs ESR1 scatter plot we created last week. Let's
create a faceted version of that plot.
```{r scatter_plot_5}
ggplot(data = metabric,
mapping = aes(x = GATA3, y = ESR1, colour = `3-gene_classifier`)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(`3-gene_classifier`))
```
This helps explain why the function is called `facet_wrap()`. When it has too
many subplots to fit across the page, it wraps around to another row. We can
control how many rows or columns to use with the `nrow` and `ncol` arguments.
```{r scatter_plot_6}
ggplot(data = metabric,
mapping = aes(x = GATA3, y = ESR1, colour = `3-gene_classifier`)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(`3-gene_classifier`), nrow = 1)
```
```{r scatter_plot_7}
ggplot(data = metabric,
mapping = aes(x = GATA3, y = ESR1, colour = `3-gene_classifier`)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(`3-gene_classifier`), ncol = 2)
```
We can combine faceting on one variable with a colour aesthetic for another
variable. For example, let's show the tumour stage status using faceting and the HER2
status using colours.
```{r scatter_plot_8}
ggplot(data = metabric,
mapping = aes(x = GATA3, y = ESR1, colour = HER2_status)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(Neoplasm_histologic_grade))
```
Instead of this we could facet on more than variable.
```{r scatter_plot_9}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(Neoplasm_histologic_grade, HER2_status))
```
Faceting on two variables is usually better done using the other faceting
function, **`facet_grid()`**. Note the change in how the formula is written.
```{r scatter_plot_10}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_grid(vars(Neoplasm_histologic_grade), vars(HER2_status))
```
Again we can use colour aesthetics alongside faceting to add further information
to our visualization.
```{r scatter_plot_11}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PAM50)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_grid(rows = vars(Neoplasm_histologic_grade),
cols = vars(HER2_status))
```
Finally, we can use a `labeller` to change the labels for each of the
categorical values so that these are more meaningful in the context of this
plot.
```{r scatter_plot_12}
grade_labels <- c("1" = "Grade I", "2" = "Grade II", "3" = "Grade III")
her2_status_labels <- c("Positive" = "HER2 positive",
"Negative" = "HER2 negative")
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PAM50)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_grid(rows = vars(Neoplasm_histologic_grade),
cols = vars(HER2_status),
labeller = labeller(Neoplasm_histologic_grade = grade_labels,
HER2_status = her2_status_labels))
```
This would certainly be necessary if we were to use ER and HER2 status on one
side of the grid.
```{r scatter_plot_13}
er_status_labels <- c("Positive" = "ER positive", "Negative" = "ER negative")
#
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PAM50)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_grid(rows = vars(Neoplasm_histologic_grade),
cols = vars(ER_status, HER2_status),
labeller = labeller(Neoplasm_histologic_grade = grade_labels,
ER_status = er_status_labels,
HER2_status = her2_status_labels))
```
---
# Summary
In this session we have covered the following concepts:
* Building up workflows by chaining operations together using the pipe operator
* Sorting rows based on values in one or more columns
* Modifying a data frame by either adding new columns or modifying existing ones
* Summarizing the values in one or more columns
* Faceting of ggplot2 visualizations
---
# Exercises
These exercises will test your ability to manipulate the METABRIC data set
by changing the values of existing columns or adding new columns by computing
new variables from existing ones.
We are expecting you to use the 5 main dplyr 'verb' functions: `select()`,
`filter()`, `arrange()`, `mutate()` and `summarize()`. Please use the pipe
operator, `%>%`, in cases where more than one operation is required to achieve
the outcome requested.
```{r message = FALSE, warning = FALSE}
library(tidyverse)
metabric <- read_csv("data/metabric_clinical_and_expression_data.csv")
```
---
**1. Investigate the subset of long-surviving breast cancer patients that didn't receive chemo or radiotherapy**
First obtain the subset of patients that received neither chemotherapy or
radiotherapy and survived for more than 20 years.
<details><summary>Answer</summary>
```{r}
patients <- metabric %>%
mutate(Survival_time_years = Survival_time / 12) %>%
filter(Chemotherapy == "NO" &
Radiotherapy == "NO" &
Survival_time_years > 20)
patients
```
</details>
Now look at the breakdown of these patients in terms of ER status. Count the
numbers of ER positive and ER negative patients in this subset. Calculate the
proportion that are ER positive.
<details><summary>Answer</summary>
```{r}
table(patients$ER_status)
mean(patients$ER_status == "Positive")
```
</details>
What does this tell us? Calculate the proportion of ER positive patients in the
whole cohort by way of a comparison.
<details><summary>Answer</summary>
```{r}
table(metabric$ER_status)
mean(metabric$ER_status == "Positive")
```
</details>
**2. Which patients have spent the largest proportion of their lives dealing
with breast cancer? Note that the current Survival_time column is months.
Present the final results as a table showing Patient_ID, Age, Age_at_diagnosis,
Survival_time_in_years and Proportion_of_life_since_diagnosis. Round all
numbers in the final table to 2 decimal places and sort it so that the patient
with the highest proportion is at the top.**
<details><summary>Hint</summary>
1. Create a column of Survival_time_years
2. Add Survival_time_years it to Age_at_diagnosis to get final Age
3. Calculate the proportion: Survival_time_years/Age
</details>
<details><summary>Answer</summary>
```{r}
metabric %>%
mutate(Survival_time_years = Survival_time / 12) %>%
mutate(Age = Age_at_diagnosis + Survival_time_years) %>%
mutate(Proportion_of_life_since_diagnosis = Survival_time_years / Age) %>%
select(Patient_ID,
Age,
Age_at_diagnosis,
Survival_time_years,
Proportion_of_life_since_diagnosis) %>%
arrange(desc(Proportion_of_life_since_diagnosis)) %>%
mutate(across(where(is.numeric), ~round(.x, digits = 2)))
```
</details>