session5.Rmd

---
title: "Week 5 -- Data manipulation with dplyr"
---

> #### Learning objectives
>
> * Chain operations together into a workflow using pipes
> * Learn the key **dplyr** functions for manipulating your data
> * Learn about faceting in ggplot2 to split your data into separate categories
>   and create a series of sub-plots arranged in a grid

---

# Data manipulation with dplyr

**dplyr** is one of the packages that gets loaded as part of the tidyverse.

```{r}
library(tidyverse)
```

dplyr is the Swiss army knife in the tidyverse, providing many useful functions
for manipulating tabular data in data frames or tibbles. We're going to look at
the key functions for filtering our data, modifying the contents and computing
summary statistics.

We'll also introduce the pipe operator, **`%>%`**, for chaining operations
together into mini workflows in a way that makes for more readable and
maintainable code.

Finally, we'll return to plotting and look at a powerful feature of ggplot2,
**faceting**, that allows you to divide your plots into subplots by splitting
the observations based on one or more categorical variables.

We'll again use the METABRIC data set to illustrate how these operations work.

```{r message = FALSE}
metabric <- read_csv("data/metabric_clinical_and_expression_data.csv")
metabric
```

---


# dplyr verbs

The dplyr operations are commonly referred to as "verbs" in a data manipulation
grammar. These verbs have a common syntax and work together in a consistent and
uniform manner. They all have the following shared behaviours:

* The first argument in each function is a data frame (or tibble)

* Any additional arguments describe what operation to perform on the data frame

* Variable names, i.e. column names, are referred to without using quotes

* The result of an operation is a new data frame

---

# Chaining operations using `%>%`

Let's consider again an earlier example in which we filtered the METABRIC data
set to retain just the patients who were still alive at the time of the study
and had survived for more than 10 years (120 months). We use `filter()` to
select the rows corresponding to the patients meeting these criteria and can
then use `select()` to only display the variables (columns) we're most
interested in.

```{r}
patients <- filter(metabric,
                   Survival_status == "LIVING" & Survival_time > 120)
patient_details <- select(patients,
                          Patient_ID,
                          Survival_time,
                          Tumour_stage,
                          Nottingham_prognostic_index)
patient_details
```

Here we've used an intermediate variable, `patients`, which we only
needed in order to get to the final result. We could just have used the same
name to avoid cluttering our environment and overwritten the results from the
`filter()` operation with those of the `select()` operation.

```{r eval = FALSE}
patients <- select(patients,
                   Patient_ID,
                   Survival_time,
                   Tumour_stage,
                   Nottingham_prognostic_index)
```

In order to avoid creating intermediate objects, another less readable way of
writing this code is to nest the `filter()` function call inside the
`select()`. Nested function calls are very common in a lot of R code you may
come across. For example, imagine we have a vector of numbers and we want to
find the square root of the mean. We could do it two steps:

```{r}
x <- c(24, 53, 352, 643, 46)
myMean <- mean(x)
sqrt(myMean)
```

Or we can nest the commands inside each other:

```{r}
sqrt(mean(x))
```

This is convenient for simple commands, but is unwieldy and not easy to read
with more complex commands:

```{r}
patients <- select(filter(metabric, Survival_status == "LIVING", Survival_time > 120), Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
nrow(patients)
```

However, there is another way chaining together a series of operations into a
mini workflow that is elegant, intuitive and makes for very readable R code.
For that we need to introduce a new operator, the **pipe** operator, **`%>%`**.

```{block type = "rmdblock"}
**The pipe operator `%>%`**

The pipe operator takes the output from one operation, i.e. whatever is on the
left-hand side of `%>%` and passes it in as the first argument to the second
operation, or function, on the right-hand side.

**`x %>% f(y)`** is equivalent to **`f(x, y)`**

For example:

`select(starwars, name, height, mass)`

can be rewritten as

`starwars %>% select(name, height, mass)`

This allows for chaining of operations into workflows, e.g.

`starwars %>%`<br/>
&nbsp;&nbsp;&nbsp;&nbsp;`filter(species == "Droid") %>%`<br/>
&nbsp;&nbsp;&nbsp;&nbsp;`select(name, height, mass)`

The `%>%` operator comes from the `magrittr` package and is available when we
load the tidyverse using `library(tidyverse)`.

Piping in R was motivated by the Unix pipe, `|`, in which the output from one
process is redirected to be the input for the next. This is so named because the
flow from one process or operation to the next resembles a pipeline.

NOTE: The `%>%` operator is from the packages **magrittr**. R now has a native
pipe operator, `|>`, but we'll stick with the **`%>%`** operator as it is
more widely used and has a few advantages over the native pipe.
```

We can rewrite the code for our filtering and column selection operations as
follows.

```{r}
patients <- metabric %>%
    filter(Survival_status == "LIVING" & Survival_time > 120) %>%
    select(Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
```

Note how each operation takes the output from the previous operation as its
first argument. This way of coding is embraced wholeheartedly in the tidyverse
hence almost every tidyverse function that works on data frames has the data
frame as its first argument. It is also the reason why tidyverse functions
return a data frame regardless of whether the output could be recast as a vector
or a single value.

"Piping", the act of chaining operations together, becomes really useful when
there are several steps involved in filtering and transforming a data set.

The usual way of developing a workflow is to build it up one step at a time,
testing the output produced at each stage. Let's do that for this case.

We start by considering just the patients who are living.

```{r}
patients <- metabric %>%
    filter(Survival_status == "LIVING")
```

We then add another filter for the survival time.

```{r}
patients <- metabric %>%
    filter(Survival_status == "LIVING") %>%
    filter(Survival_time > 120)
```

At each stage we look at the resulting `patients` data frame to check
we're happy with the result.

Finally we only want certain columns, so we add a `select()` operation.

```{r}
patients <- metabric %>%
    filter(Survival_status == "LIVING") %>%
    filter(Survival_time > 120) %>%
    select(Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
# print out the result
patients
```

When continuing our workflow across multiple lines, we need to be careful to
ensure the `%>%` is at the end of the line. If we try to place this at the start
of the next line, R will think we've finished the workflow prematurely and will
report an error at what it considers the next statement, i.e. the line that
begins with `%>%`.

```
# R considers the following to be 2 separate commands, the first of which ends
# with the first filter operation and runs successfully.
# The second statement is the third and last line, is not a valid commmand and
# so you'll get an error message
patients <- metabric %>%
    filter(Survival_status == "LIVING")
    %>% filter(Survival_time > 120)
```

This is very similar to what we saw with adding layers and other components to
a ggplot using the `+` operator, where we needed the `+` to be at the end of a
line.

We'll be using the pipe `%>%` operator throughout the rest of the course so
you'd better get used to it. But actually we think you'll come to love it as
much as we do.

---

# Modifying data using `mutate()`

In one of the examples of filtering observations using `filter()` above, we
created a new logical variable called `Deceased`.

```{r}html_document"
metabric$Deceased <- metabric$Survival_status == "DECEASED"
```

This is an example of a rather common type of data manipulation in which we
create a new column based on the values contained in one or more other columns.
The `dplyr` package provides the `mutate()` function for this purpose.

```{r}
metabric <- mutate(metabric, Deceased = Survival_status == "DECEASED")
```

Both of these methods adds the new column to the end.

Note that variables names in the `mutate()` function call do not need to be
prefixed by `metabric$` just as they don't in any of the `dplyr` functions.

We can overwrite a column and this is quite commonly done when we want to change
the units. For example, our tumour size values are in millimetres but the
formula for the Nottingham prognostic index needs the tumour size to be in
centimetres. We can convert to `Tumour_size` to centimetres by dividing by 10.

```{r}
metabric <- mutate(metabric, Tumour_size = Tumour_size / 10)
```

```{block type = "rmdblock"}
**Nottingham Prognostic Index**

The Nottingham prognostic index (**NPI**) is used to determine prognosis
following surgery for breast cancer. Its value is calculated using three
pathological criteria: the size of the tumour, the number of lymph nodes
involved, and the grade of the tumour.

The formula for the Nottingham prognostic index is:

**`NPI = (0.2 * S) + N + G`**

where

* **S** is the size of the tumour in centimetres
* **N** is the node status (0 nodes = 1, 1-3 nodes = 2, >3 nodes = 3)
* **G** is the grade of tumour (Grade I = 1, Grade II = 2, Grade III = 3)
```

We'll recalculate the Nottingham prognostic index using the values
`Tumour_size`, `Neoplasm_histologic_grade` and `Lymph_node_status` columns in
our METABRIC data frame.

```{r}
metabric <- metabric %>%
    mutate(NPI = (0.2 * Tumour_size) +
               Lymph_node_status +
               Neoplasm_histologic_grade)

select(metabric,
       Tumour_size,
       Lymph_node_status,
       Neoplasm_histologic_grade,
       NPI,
       Nottingham_prognostic_index)
```

There is a discrepency. Can you see what the problem is?

It appears that the tumour size wasn't correctly converted into centimetres in
the original NPI calculation or that the wrong scaling factor for tumour size
was used.

```{r scatter_plot_1}
ggplot(data = metabric) +
    geom_point(mapping = aes(x = Age_at_diagnosis, y = NPI), na.rm = TRUE)
```

The **`round()`** function is really useful for rounding numerical values to a
specified number of decimal places. We'll read in the METABRIC data again and
create a small workflow that carries out the tumour size conversion, computes
the NPI, rounds the tumour size and the resulting NPI value to 1 decimal place
and displays the results in decreasing order of NPI.

```{r message = FALSE}
read_csv("data/metabric_clinical_and_expression_data.csv") %>%
    mutate(Tumour_size = Tumour_size / 10) %>%
    mutate(NPI = (0.2 * Tumour_size) +
               Lymph_node_status +
               Neoplasm_histologic_grade) %>%
    mutate(Tumour_size = round(Tumour_size, digits = 1)) %>%
    mutate(NPI = round(NPI, digits = 1)) %>%
    arrange(desc(NPI)) %>%
    select(Tumour_size, Lymph_node_status, Neoplasm_histologic_grade, NPI)
```

## Mutating multiple columns

In that last workflow we included the same rounding operation applied to two
different variables. It would be nice to be able to carry out just the one
`mutate()` but apply it to both `Tumour_size` and `NPI` columns and we can
using **`across()`**.

`across` is used inside `mutate` and takes two arguments, first a selection of
columns to operate across and second a function to apply to those columns.

So if we just wanted to round to the the nearest integer:

```{r}
metabric %>%
    mutate(across(c(Tumour_size, NPI), round)) %>%
    select(Patient_ID, Tumour_size, NPI)
```

This applies the function `round` across all columns specified in the first
argument for `across`.

This is nice and simple but things can get a little more complex if we want more
than just a straightforward function.

We may come across situations where we'd like to apply the same operation to
multiple columns but where there is no available function in R to do what we
want or situations we need to pass extra arguments to our function.

In these cases the syntax get a little trickier. We need to write what is known
as an "anonymous function".

Let's say we want to convert the petal and sepal measurements in the `iris`
data set from centimetres to millimetres. We'd either need to create a new
function to do this conversion or we could use what is known as an anonymous
function.

There is no 'multiply by 10' function and it seems a bit pointless to create
one just for this conversion so we'll use an anonymous function instead --
anonymous because it has no name, it's an _in situ_ function only used in our
`mutate()` function call.

```{r}
iris %>%
    as_tibble() %>%
    mutate(across(Sepal.Length:Petal.Width, ~ .x * 10))
```

The **`~`** denotes that we're using an anonymous function (it is the symbol for
formulae in R) and the `.x` is a placeholder for the column being operated on. In
this case, we're multiplying each of the columns between `Sepal.Length` and
`Petal.Width` inclusive by 10.

If we wish to use an existing function but provide additional arguments, then
the syntax is:

`~func(.x, ...)`

Where `.x` is going to be replaced with the column names and `...` is the extra
arguments we need to add.

So if we want to round to 1 decimal place:

```{r}
metabric %>%
    mutate(across(c(Tumour_size, NPI), ~round(.x, digits = 1))) %>%
    select(Patient_ID, Tumour_size, NPI)
```

We can use the range operator and the same helper functions as we did for
selecting columns using `select()` inside `across()`.

For example, we might decide that our expression values are given to a much
higher degree of precision than is strictly necessary.

```{r}
metabric %>%
    mutate(across(ESR1:MLPH, ~round(.x, digits = 2))) %>%
    select(Patient_ID, ESR1:MLPH)
```

Or we could decide that all the columns whose names end with "\_status" are in
fact categorical variables and should be converted to factors.

```{r}
metabric %>%
    mutate(across(ends_with("_status"), as.factor)) %>%
    select(Patient_ID, ends_with("_status"))
```

---

# Computing summary values using `summarise()`

We'll cover the fifth of the main dplyr 'verb' functions, **`summarise()`**,
only briefly here. This function computes summary values for one or more
variables (columns) in a table. Here we will summarise values for the entire
table but this function is much more useful in combination with `group_by()`
in working on groups of observations within the data set. We will look at
summarizing groups of observations next week.

Any function that calculates a single scalar value from a vector can be used
with `summarise()`. For example, the `mean()` function calculates the arithmetic
mean of a numeric vector. Let's calculate the average ESR1 expression for tumour
samples in the METABRIC data set.

```{r}
mean(metabric$ESR1)
```

The equivalent operation using `summarise()` is:

```{r}
summarise(metabric, mean(ESR1))
```

If you prefer Oxford spelling, in which -ize is preferred to -ise, you’re in
luck as dplyr accommodates the alternative spelling.

Both of the above statements gave the same average expression value but these
were output in differing formats. The `mean()` function collapses a vector to
single scalar value, which as we know is in fact a vector of length 1. The
`summarise()` function, as with most tidyverse functions, returns another data
frame, albeit one in which there is a single row and a single column.

Returning a data frame might be quite useful, particularly if we’re summarising
multiple columns or using more than one function, for example computing the
average and standard deviation.

```{r}
summarise(metabric, ESR1_mean = mean(ESR1), ESR1_sd = sd(ESR1))
```

Notice how we also named the output columns in this last example.

```{block type = "rmdblock"}
**`summarise()`**

`summarise()` collapses a data frame into a single row by calculating summary
values of one or more of the columns.

It can take any function that takes a vector of values and returns a single
value. Some of the more useful functions include:

* Centre: **`mean()`**, **`median()`**
* Spread: **`sd()`**, **`mad()`**
* Range: **`min()`**, **`max()`**, **`quantile()`**
* Position: **`first()`**, **`last()`**
* Count: **`n()`**

Note the `first()`, `last()` and `n()` are only really useful when working on
groups of observations using **`group_by()`**.

**`n()`** is a special function that returns the number of observations; it
doesn't take a vector argument, i.e. a column.
```

It is also possible to summarise using a function that takes more than one
value, i.e. from multiple columns. For example, we could compute the correlation
between the expression of FOXA1 and MLPH.

```{r}
summarise(metabric, correlation = cor(FOXA1, MLPH))
```

## Summarizing multiple columns

As with `mutate()` we can use **across()** to summarise across multiple columns
using the same function.

```{r}
summarise(metabric, across(c(FOXA1, MLPH), mean))
```

We can use the `everything()` selector to summarise across all the columns in
the table.

```{r}
metabric %>%
    select(ESR1:MLPH) %>%
    summarise(across(everything(), mean))
```

But, you have to be careful that all columns can be summarised with the given
summary function. For example, what happens if we try to compute an average of
a set of character values?

```{r}
metabric %>%
    summarise(across(everything(), ~mean(.x, na.rm = TRUE)))
```

We get a lot of warning messages and `NA` values for those columns for which
computing an average does not make sense.

We can combine `across()` with `where()` and a logical test (e.g. `is.numeric`)
to select those columns for which a summarization function is appropriate.

```{r}
metabric %>%
    summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE)))
```

It is possible to summarise using more than one function in which case a list of
functions needs to be provided.

```{r}
metabric %>%
    summarise(across(c(ESR1, ERBB2, PGR), list(mean, sd)))
```

Pretty neat but I'm not sure about those column headings in the output --
fortunately we have some control over these.

```{r}
metabric %>%
    summarise(across(c(ESR1, ERBB2, PGR), list(Average = mean, stdev = sd)))
```

Anonymous functions can also be used with summarise.

For example, suppose we want to compute the correlation of the
expression for FOXA1 against all other genes to see which was most strongly
correlated. Here is how we could do this in a single `summarise()` statement
using an anonymous function.

```{r}
metabric %>%
    summarise(across(c(ESR1:MLPH, -FOXA1), ~cor(.x, FOXA1)))
```

Notice how we selected all genes between ESR1, the first gene column in our
data frame, and MLPH, the last gene column, but then excluded FOXA1 as we're not
all that interested in the correlation of FOXA1 with itself (we know the answer
is 1).

---

# Faceting with ggplot2

Finally, let's change track completely and take a look at a very useful feature
of ggplot2 -- **faceting**.

Faceting allows you to split your plot into subplots, or facets, based on one
or more categorical variables. Each of the subplots displays a subset of the
data.

There are two faceting functions, **`facet_wrap()`** and **`facet_grid()`**.

Let's create a scatter plot of GATA3 and ESR1 expression values where we're
displaying the PR positive and PR negative patients using different colours.
This is a very similar to a plot we created last week.

```{r scatter_plot_2}
ggplot(data = metabric,
       mapping = aes(x = GATA3, y = ESR1, colour = PR_status)) +
    geom_point(size = 0.5, alpha = 0.5)
```

An alternative is to use faceting with **`facet_wrap()`**.

```{r scatter_plot_3}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
    geom_point(size = 0.5, alpha = 0.5) +
    facet_wrap(vars(PR_status))
```

This produces two plots, side-by-side, one for each of the categories in the
`PR_status` variable, with a banner across the top of each for the category.

The variable(s) used for faceting are specified using `vars()` in a similar way
to the selection of variables for `mutate_at()` and `summarise_at()`.

We can still use separate colours if we prefer things to be, well, colourful.

```{r scatter_plot_4}
ggplot(data = metabric,
       mapping = aes(x = GATA3, y = ESR1, colour = PR_status)) +
    geom_point(size = 0.5, alpha = 0.5) +
    facet_wrap(vars(PR_status))
```

Faceting is usually better than displaying groups using different colours when
there are more than two or three groups when it can be difficult to really tell
which points belong to each group. A case in point is for the 3-gene
classification in the GATA3 vs ESR1 scatter plot we created last week. Let's
create a faceted version of that plot.

```{r scatter_plot_5}
ggplot(data = metabric,
       mapping = aes(x = GATA3, y = ESR1, colour = `3-gene_classifier`)) +
    geom_point(size = 0.5, alpha = 0.5) +
    facet_wrap(vars(`3-gene_classifier`))
```

This helps explain why the function is called `facet_wrap()`. When it has too
many subplots to fit across the page, it wraps around to another row. We can
control how many rows or columns to use with the `nrow` and `ncol` arguments.

```{r scatter_plot_6}
ggplot(data = metabric,
       mapping = aes(x = GATA3, y = ESR1, colour = `3-gene_classifier`)) +
    geom_point(size = 0.5, alpha = 0.5) +
    facet_wrap(vars(`3-gene_classifier`), nrow = 1)
```

```{r scatter_plot_7}
ggplot(data = metabric,
       mapping = aes(x = GATA3, y = ESR1, colour = `3-gene_classifier`)) +
    geom_point(size = 0.5, alpha = 0.5) +
    facet_wrap(vars(`3-gene_classifier`), ncol = 2)
```

We can combine faceting on one variable with a colour aesthetic for another
variable. For example, let's show the tumour stage status using faceting and the HER2
status using colours.

```{r scatter_plot_8}
ggplot(data = metabric,
       mapping = aes(x = GATA3, y = ESR1, colour = HER2_status)) +
    geom_point(size = 0.5, alpha = 0.5) +
    facet_wrap(vars(Neoplasm_histologic_grade))
```

Instead of this we could facet on more than variable.

```{r scatter_plot_9}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
    geom_point(size = 0.5, alpha = 0.5) +
    facet_wrap(vars(Neoplasm_histologic_grade, HER2_status))
```

Faceting on two variables is usually better done using the other faceting
function, **`facet_grid()`**. Note the change in how the formula is written.

```{r scatter_plot_10}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
    geom_point(size = 0.5, alpha = 0.5) +
    facet_grid(vars(Neoplasm_histologic_grade), vars(HER2_status))
```

Again we can use colour aesthetics alongside faceting to add further information
to our visualization.

```{r scatter_plot_11}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PAM50)) +
    geom_point(size = 0.5, alpha = 0.5) +
    facet_grid(rows = vars(Neoplasm_histologic_grade),
               cols = vars(HER2_status))
```

Finally, we can use a `labeller` to change the labels for each of the
categorical values so that these are more meaningful in the context of this
plot.

```{r scatter_plot_12}
grade_labels <- c("1" = "Grade I", "2" = "Grade II", "3" = "Grade III")
her2_status_labels <- c("Positive" = "HER2 positive",
                        "Negative" = "HER2 negative")

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PAM50)) +
    geom_point(size = 0.5, alpha = 0.5) +
    facet_grid(rows = vars(Neoplasm_histologic_grade),
               cols = vars(HER2_status),
               labeller = labeller(Neoplasm_histologic_grade = grade_labels,
                                   HER2_status = her2_status_labels))
```

This would certainly be necessary if we were to use ER and HER2 status on one
side of the grid.

```{r scatter_plot_13}
er_status_labels <- c("Positive" = "ER positive", "Negative" = "ER negative")
#
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PAM50)) +
    geom_point(size = 0.5, alpha = 0.5) +
    facet_grid(rows = vars(Neoplasm_histologic_grade),
               cols = vars(ER_status, HER2_status),
               labeller = labeller(Neoplasm_histologic_grade = grade_labels,
                                   ER_status = er_status_labels,
                                   HER2_status = her2_status_labels))
```

---

# Summary

In this session we have covered the following concepts:

* Building up workflows by chaining operations together using the pipe operator
* Sorting rows based on values in one or more columns
* Modifying a data frame by either adding new columns or modifying existing ones
* Summarizing the values in one or more columns
* Faceting of ggplot2 visualizations

---

# Exercises

These exercises will test your ability to manipulate the METABRIC data set
by changing the values of existing columns or adding new columns by computing
new variables from existing ones.

We are expecting you to use the 5 main dplyr 'verb' functions: `select()`,
`filter()`, `arrange()`, `mutate()` and `summarize()`. Please use the pipe
operator, `%>%`, in cases where more than one operation is required to achieve
the outcome requested.

```{r message = FALSE, warning = FALSE}
library(tidyverse)
metabric <- read_csv("data/metabric_clinical_and_expression_data.csv")
```

---

**1. Investigate the subset of long-surviving breast cancer patients that didn't receive chemo or radiotherapy**

First obtain the subset of patients that received neither chemotherapy or
radiotherapy and survived for more than 20 years.

<details><summary>Answer</summary>
```{r}
patients <- metabric %>%
    mutate(Survival_time_years = Survival_time / 12) %>%
    filter(Chemotherapy == "NO" &
               Radiotherapy == "NO" &
               Survival_time_years > 20)

patients
```
</details>

Now look at the breakdown of these patients in terms of ER status. Count the
numbers of ER positive and ER negative patients in this subset. Calculate the
proportion that are ER positive.

<details><summary>Answer</summary>
```{r}
table(patients$ER_status)
mean(patients$ER_status == "Positive")
```
</details>

What does this tell us? Calculate the proportion of ER positive patients in the
whole cohort by way of a comparison.

<details><summary>Answer</summary>
```{r}
table(metabric$ER_status)
mean(metabric$ER_status == "Positive")
```
</details>


**2. Which patients have spent the largest proportion of their lives dealing
with breast cancer? Note that the current Survival_time column is months.
Present the final results as a table showing Patient_ID, Age, Age_at_diagnosis,
Survival_time_in_years and Proportion_of_life_since_diagnosis. Round all
numbers in the final table to 2 decimal places and sort it so that the patient
with the highest proportion is at the top.**

<details><summary>Hint</summary>
1. Create a column of Survival_time_years
2. Add Survival_time_years it to Age_at_diagnosis to get final Age
3. Calculate the proportion: Survival_time_years/Age
</details>

<details><summary>Answer</summary>
```{r}
metabric %>%
    mutate(Survival_time_years = Survival_time / 12) %>%
    mutate(Age = Age_at_diagnosis + Survival_time_years) %>%
    mutate(Proportion_of_life_since_diagnosis = Survival_time_years / Age) %>%
    select(Patient_ID,
           Age,
           Age_at_diagnosis,
           Survival_time_years,
           Proportion_of_life_since_diagnosis) %>%
    arrange(desc(Proportion_of_life_since_diagnosis)) %>%
    mutate(across(where(is.numeric), ~round(.x, digits = 2)))
```
</details>