09_logistic_regression.Rmd

# Logistic regression{#chap09-h1}
\index{logistic regression@\textbf{logistic regression}}

> All generalizations are false, including this one.  
> Mark Twain

```{r echo=FALSE, message=FALSE}
library(knitr)
library(kableExtra)
library(ggplot2)
mykable = function(x, caption = "CAPTION", ...){
  kable(x, row.names = FALSE, align = c("l", "l", "r", "r", "r", "r", "r", "r", "r"), 
        booktabs = TRUE, caption = caption, 
        linesep = c("", "", "\\addlinespace"), ...) %>%
    kable_styling(latex_options = c("scale_down", "hold_position"))
}
theme_set(theme_bw())
```

## Generalised linear modelling

Do not start here! 
The material covered in this chapter is best understood after having read linear regression (Chapter \@ref(chap07-h1)) and working with categorical outcome variables (Chapter \@ref(chap08-h1)).

Generalised linear modelling is an extension to the linear modelling we are now familiar with. 
It allows the principles of linear regression to be applied when outcomes are not continuous numeric variables.  

## Binary logistic regression
\index{logistic regression@\textbf{logistic regression}!binary data}
\index{binary data}

A regression analysis is a statistical approach to estimating the relationships between variables, often by drawing straight lines through data points. 
For instance, we may try to predict blood pressure in a group of patients based on their coffee consumption (Figure \@ref(fig:chap07-fig-regression) from Chapter \@ref(chap07-h1)). 
As blood pressure and coffee consumption can be considered on a continuous scale, this is an example of simple linear regression. 

Logistic regression is an extension of this, where the variable being predicted is *categorical*. 
We will focus on binary logistic regression, where the dependent variable has two levels, e.g., yes or no, 0 or 1, dead or alive.
Other types of logistic regression include 'ordinal', when the outcome variable has >2 ordered levels, and 'multinomial', where the outcome variable has >2 levels with no inherent order. 

We will only deal with binary logistic regression.
When we use the term 'logistic regression', that is what we are referring to. 

We have good reason. 
In healthcare we are often interested in an event (like death) occurring or not occurring. 
Binary logistic regression can tell us the probability of this outcome occurring in a patient with a particular set of characteristics.

Although in binary logistic regression the outcome must have two levels, remember that the predictors (explanatory variables) can be either continuous or categorical.

<!-- Consider making all "Questions" specific in title, e.g. Coffee and CV event etc -->

### The Question (1)

As in previous chapters, we will use concrete examples when discussing the principles of the approach. 
We return to our example of coffee drinking. 
Yes, we are a little obsessed with coffee. 

Our outcome variable was previously blood pressure. 
We will now consider our outcome as the occurrence of a cardiovascular (CV) event over a 10-year period. 
A cardiovascular event includes the diagnosis of ischemic heart disease, a heart attack (myocardial infarction), or a stroke (cerebrovascular accident). 
The diagnosis of a cardiovascular event is clearly a binary condition, it either happens or it does not. 
This is ideal for modelling using binary logistic regression. 
But remember, the data are completely simulated and not based on anything in the real world. 
This bit is just for fun!

### Odds and probabilities
\index{logistic regression@\textbf{logistic regression}!odds and probabilities}

To understand logistic regression we need to remind ourselves about odds and probability. 
Odds and probabilities can get confusing so get them straight with Figure \@ref(fig:chap09-fig-odds).

```{r chap09-fig-odds, echo = FALSE, fig.cap="Probability vs odds."}
knitr::include_graphics("images/chapter09/0_odds.png", auto_pdf = TRUE)
```

In many situations, there is no particular reason to prefer one to the other. 
However, humans seem to have a preference for expressing chance in terms of probabilities, while odds have particular mathematical properties that make them useful in regression.

When a probability is 0, the odds are 0. 
When a probability is between 0 and 0.5, the odds are less than 1.0 (i.e., less than "1 to 1"). 
As probability increases from 0.5 to 1.0, the odds increase from 1.0 to approach infinity. 

Thus the range of probability is 0 to 1 and the range of odds is 0 to $+\infty$.

Odds and probabilities can easily be interconverted. 
For example, if the odds of a patient dying from a disease are 1/3 (in horse racing this is stated as '3 to 1 against'), then the probability of death (also known as risk) is 0.25 (or 25%). 
Odds of `1 to 1` equal 50%.

$Odds = \frac{p}{1-p}$, where $p$ is the probability of the outcome occurring.

$Probability = \frac{odds}{odds+1}$.


### Odds ratios
\index{logistic regression@\textbf{logistic regression}!odds ratio}
\index{odds ratio}

Another important term to remind ourselves of is the 'odds ratio'. 
Why?
Because in a logistic regression the slopes of fitted lines (coefficients) can be interpreted as odds ratios. 
This is very useful when interpreting the association of a particular predictor with an outcome. 

For a given categorical predictor such as smoking, the difference in chance of the outcome occurring for smokers vs non-smokers can be expressed as a ratio of odds or odds ratio (Figure \@ref(fig:chap09-fig-or)).
For example, if the odds of a smoker having a CV event are 1.5 and the odds of a non-smoker are 1.0, then the odds of a smoker having an event are 1.5-times greater than a non-smoker, odds ratio = 1.5. 

```{r chap09-fig-or, echo = FALSE, fig.cap="Odds ratios."}
knitr::include_graphics("images/chapter09/1_or.png", auto_pdf = TRUE)
```

An alternative is a ratio of probabilities which is called a risk ratio or relative risk. 
We will continue to work with odds ratios given they are an important expression of effect size in logistic regression analysis.


### Fitting a regression line
\index{logistic regression@\textbf{logistic regression}!fitted line}

Let's return to the task at hand. 
The difficulty in moving from a continuous to a binary outcome variable quickly becomes obvious. 
If our $y$-axis only has two values, say 0 and 1, then how can we fit a line through our data points?

An assumption of linear regression is that the dependent variable is continuous, unbounded, and measured on an interval or ratio scale.
Unfortunately, binary dependent variables fulfil none of these requirements. 

The answer is what makes logistic regression so useful. 
Rather than estimating $y=0$ or $y=1$ from the $x$-axis, we estimate the *probability* of $y=1$. 

There is one more difficulty in this though.
Probabilities can only exist for values of 0 to 1. 
The probability scale is therefore not linear - straight lines do not make sense on it. 

As we saw above, the odds scale runs from 0 to $+\infty$. 
But here, probabilities from 0 to 0.5 are squashed into odds of 0 to 1, and probabilities from 0.5 to 1 have the expansive comfort of 1 to $+\infty$.

This is why we fit binary data on a *log-odds scale*. 

A log-odds scale sounds incredibly off-putting to non-mathematicians, but it is the perfect solution.

* Log-odds run from $-\infty$ to $+\infty$;
* odds of 1 become log-odds of 0;
* a doubling and a halving of odds represent the same distance on the scale. 

```{r}
log(1)
log(2)
log(0.5)
```

I'm sure some are shouting 'obviously' at the page. 
That is good!

This is wrapped up in a transformation (a bit like the transformations shown in Section \@ref(chap06-transform)) using the so-called logit function. 
This can be skipped with no loss of understanding, but for those who just-gots-to-see, the logit function is, 

$\log_e (\frac{p}{1-p})$, where $p$ is the probability of the outcome occurring. 

Figure \@ref(fig:chap09-fig-logodds) demonstrates the fitted lines from a logistic regression model of cardiovascular event by coffee consumption, stratified by smoking on the log-odds scale (A) and the probability scale (B). 
We could conclude, for instance, that on average, non-smokers who drink 2 cups of coffee per day have a 50% chance of a cardiovascular event. 

```{r chap09-fig-logodds, echo = FALSE, fig.cap="A logistic regression model of life-time cardiovascular event occurrence by coffee consumption stratified by smoking (simulated data). Fitted lines plotted on the log-odds scale (A) and probability scale (B). *lines are straight when no polynomials or splines are included in regression. "}
knitr::include_graphics("images/chapter09/2_prob_logodds.png", auto_pdf = TRUE)
```

### The fitted line and the logistic regression equation

Figure \@ref(fig:chap09-fig-equation) links the logistic regression equation, the appearance of the fitted lines on the probability scale, and the output from a standard base `R` analysis. 
The dots at the top and bottom of the plot represent whether individual patients have had an event or not. The fitted line, therefore, represents the point-to-point probability of a patient with a particular set of characteristics having the event or not. 
Compare this to Figure \@ref(fig:chap07-fig-equation) to be clear on the difference. 
The slope of the line is linear on the log-odds scale and these are presented in the output on the log-odds scale. 

Thankfully, it is straightforward to convert these to odds ratios, a measure we can use to communicate effect size and direction effectively. 
Said in more technical language, the exponential of the coefficient on the log-odds scale can be interpreted as an odds ratio.  

For a continuous variable such as cups of coffee consumed, the odds ratio is the change in odds of a CV event associated with a 1 cup increase in coffee consumption. 
We are dealing with linear responses here, so the odds ratio is the same for an increase from 1 to 2 cups, or 3 to 4 cups etc. 
Remember that if the odds ratio for 1 unit of change is 1.5, then the odds ratio for 2 units of change is $exp(log(1.5)*2) = 2.25$.

For a categorical variable such as smoking, the odds ratio is the change in odds of a CV event associated with smoking compared with not smoking (the reference level). 

```{r chap09-fig-equation, echo = FALSE, fig.cap="Linking the logistic regression fitted line and equation (A) with the R output (B)."}
knitr::include_graphics("images/chapter09/4_equation.png", auto_pdf = TRUE)
```

### Effect modification and confounding
\index{logistic regression@\textbf{logistic regression}!effect modification}
\index{logistic regression@\textbf{logistic regression}!confounding}
\index{logistic regression@\textbf{logistic regression}!interactions}

As with all multivariable regression models, logistic regression allows the incorporation of multiple variables which all may have direct effects on outcome or may confound the effect of another variable. 
This was explored in detail in Section \@ref(chap07-confound); all of the same principles apply.

Adjusting for effect modification and confounding allows us to isolate the direct effect of an explanatory variable of interest upon an outcome. 
In our example, we are interested in direct effect of coffee drinking on the occurrence of cardiovascular disease, independent of any association between coffee drinking and smoking. 

Figure \@ref(fig:chap09-fig-types) demonstrates simple, additive and multiplicative models. 
Think back to Figure \@ref(fig:chap07-fig-types) and the discussion around it as these terms are easier to think about when looking at the linear regression example, but essentially they work the same way in logistic regression.

Presented on the probability scale, the effect of the interaction is difficult to see. 
It is obvious on the log-odds scale that the fitted lines are no longer constrained to be parallel. 

```{r chap09-fig-types, echo = FALSE, fig.cap="Multivariable logistic regression (A) with additive (B) and multiplicative (C) effect modification."}
knitr::include_graphics("images/chapter09/6_types.png", auto_pdf = TRUE)
```

The interpretation of the interaction term is important. 
The exponential of the interaction coefficient term represents a 'ratio-of-odds ratios'.
This is easiest to see through a worked example. 
In Figure \@ref(fig:chap09-fig-interaction) the effect of coffee on the odds of a cardiovascular event can be compared in smokers and non-smokers. 
The effect is now different given the inclusion of a significant interaction term. 
Please check back to the linear regression chapter if this is not making sense. 

```{r chap09-fig-interaction, echo = FALSE, fig.cap="Multivariable logistic regression with interaction term. The exponential of the interaction term is a ratio-of-odds ratios (ROR)."}
knitr::include_graphics("images/chapter09/7_interactions.png", auto_pdf = TRUE)
```

## Data preparation and exploratory analysis

### The Question (2)

We will go on to explore the `boot::melanoma` dataset introduced in Chapter \@ref(chap08-h1). 
The data consist of measurements made on patients after surgery to remove the melanoma skin cancer in the University Hospital of Odense, Denmark, between 1962 and 1977.

Malignant melanoma is an aggressive and highly invasive cancer, making it difficult to treat.

To determine how advanced it is, staging is based on the depth of the tumour.
The current TNM classification cut-offs are:

- T1: $\leq$ 1.0 mm depth
- T2: 1.1 to 2.0 mm depth
- T3: 2.1 to 4.0 mm depth
- T4: > 4.0 mm depth

This will be important in our analysis as we will create a new variable based upon this.

Using logistic regression, we will investigate factors associated with death from malignant melanoma with particular interest in tumour ulceration.

### Get the data

The Help page (F1 on `boot::melanoma`) gives us its data dictionary including the definition of each variable and the coding used.

```{r, message = F}
melanoma <- boot::melanoma
```

### Check the data
As before, always carefully check and clean new dataset before you start the analysis.  

```{r eval=FALSE}
library(tidyverse)
library(finalfit)
melanoma %>% glimpse()
melanoma %>% ff_glimpse()
```

### Recode the data

We have seen some of this already (Section \@ref(chap08-recode): Recode data), but for this particular analysis we will recode some further variables. 

```{r message=FALSE}
library(tidyverse)
library(finalfit)
melanoma <- melanoma %>% 
  mutate(sex.factor = factor(sex) %>%          
           fct_recode("Female" = "0",
                      "Male"   = "1") %>% 
           ff_label("Sex"),   
         
         ulcer.factor = factor(ulcer) %>% 
           fct_recode("Present" = "1",
                      "Absent"  = "0") %>% 
           ff_label("Ulcerated tumour"),
         
         age  = ff_label(age,  "Age (years)"),
         year = ff_label(year, "Year"),
         
         status.factor = factor(status) %>% 
           fct_recode("Died melanoma"  = "1",
                      "Alive" = "2",
                      "Died - other" = "3") %>% 
           fct_relevel("Alive") %>% 
           ff_label("Status"),
         
         t_stage.factor = 
           thickness %>% 
           cut(breaks = c(0, 1.0, 2.0, 4.0, 
                          max(thickness, na.rm=TRUE)),
               include.lowest = TRUE)
  )
```

Check the `cut()` function has worked:

```{r}
melanoma$t_stage.factor %>% levels()
```

Recode for ease.

```{r}
melanoma <- melanoma %>% 
  mutate(
    t_stage.factor = 
      fct_recode(t_stage.factor,
                 "T1" = "[0,1]",
                 "T2" = "(1,2]",
                 "T3" = "(2,4]",
                 "T4" = "(4,17.4]") %>% 
      ff_label("T-stage")
  )
```

We will now consider our outcome variable. 
With a binary outcome and health data, we often have to make a decision as to *when* to determine if that variable has occurred or not. 
In the next chapter we will look at survival analysis where this requirement is not needed. 

Our outcome of interest is death from melanoma, but we need to decide when to define this. 

A quick histogram of `time` stratified by `status.factor` helps. 
We can see that most people who died from melanoma did so before 5 years (Figure \@ref(fig:chap09-fig-morthist)). 
We can also see that the status most of those who did not die is known beyond 5 years. 


```{r chap09-fig-morthist, fig.height=3.5, fig.width=6, message=TRUE, fig.cap = "Time to outcome/follow-up times for patients in the melanoma dataset."}
library(ggplot2)
melanoma %>% 
  ggplot(aes(x = time/365)) + 
  geom_histogram() + 
  facet_grid(. ~ status.factor)
```

Let's decide then to look at 5-year mortality from melanoma. 
The definition of this will be at 5 years after surgery, who had died from melanoma and who had not. 


```{r}
# 5-year mortality
melanoma <- melanoma %>% 
  mutate(
    mort_5yr = 
      if_else((time/365) < 5 & 
                (status == 1), 
              "Yes",          # then
              "No") %>%       # else
      fct_relevel("No") %>% 
      ff_label("5-year survival")
  )
```

### Plot the data

We are interested in the association between tumour ulceration and outcome (Figure \@ref(fig:chap09-fig-ulceration)).

```{r chap09-fig-ulceration, fig.height=3, fig.width=5, fig.cap = "Exploration ulceration and outcome (5-year mortality)."}
p1 <- melanoma %>% 
  ggplot(aes(x = ulcer.factor, fill = mort_5yr)) + 
  geom_bar() + 
  theme(legend.position = "none")

p2 <- melanoma %>% 
  ggplot(aes(x = ulcer.factor, fill = mort_5yr)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")

library(patchwork)
p1 + p2
```

As we might have anticipated from our work in the previous chapter, 5-year mortality is higher in patients with ulcerated tumours compared with those with non-ulcerated tumours. 

We are also interested in other variables that may be associated with tumour ulceration. 
If they are also associated with our outcome, then they will confound the estimate of the direct effect of tumour ulceration. 

We can plot out these relationships, or tabulate them instead. 

### Tabulate data

We will use the convenient `summary_factorlist()` function from the `finalfit` package to look for differences across other variables by tumour ulceration.

```{r, eval=FALSE}
library(finalfit)
dependent <- "ulcer.factor"
explanatory <- c("age", "sex.factor", "year", "t_stage.factor")
melanoma %>% 
  summary_factorlist(dependent, explanatory, p = TRUE,
                     add_dependent_label = TRUE)
```

```{r, echo=FALSE}
library(finalfit)
dependent <- "ulcer.factor"
explanatory <- c("age", "sex.factor", "year", "t_stage.factor")
melanoma %>% 
  summary_factorlist(dependent, explanatory, p = TRUE,
                     column = TRUE,       # proportions by column
                     add_dependent_label = TRUE) %>% 
  mykable(caption="Multiple variables by explanatory variable of interest: Malignant melanoma ulceration by patient and disease variables.") %>% 
  column_spec(1, width = "3.5cm")
```

It appears that patients with ulcerated tumours were older, more likely to be male, and had thicker/higher stage tumours. 
It is important therefore to consider inclusion of these variables in a regression model. 

## Model assumptions
\index{logistic regression@\textbf{logistic regression}!assumptions}

Binary logistic regression is robust to many of the assumptions which cause problems in other statistical analyses. 
The main assumptions are:

1. Binary dependent variable - this is obvious, but as above we need to check (alive, death from disease, death from other causes doesn't work);
2. Independence of observations - the observations should not be repeated measurements or matched data;
3. Linearity of continuous explanatory variables and the log-odds outcome - take age as an example. If the outcome, say death, gets more frequent or less frequent as age rises, the model will work well. However, say children and the elderly are at high risk of death, but those in middle years are not, then the relationship is not linear. Or more correctly, it is not monotonic, meaning that the response does not only go in one direction;
4. No multicollinearity - explanatory variables should not be highly correlated with each other.

### Linearity of continuous variables to the response
\index{logistic regression@\textbf{logistic regression}!loess}

A graphical check of linearity can be performed using a best fit "loess" line. 
This is on the probability scale, so it is not going to be straight. 
But it should be monotonic - it should only ever go up or down. 

```{r chap09-fig-loess, fig.height=3, fig.width=5.5, message=FALSE, warning=FALSE, fig.cap = "Linearity of our continuous explanatory variables to the outcome (5-year mortality)."}
melanoma %>% 
  mutate(
    mort_5yr.num = as.numeric(mort_5yr) - 1
  ) %>% 
  select(mort_5yr.num, age, year) %>% 
  pivot_longer(all_of(c("age", "year")), names_to = "predictors") %>% 
  ggplot(aes(x = value, y = mort_5yr.num)) + 
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  facet_wrap(~predictors, scales = "free_x")
```

Figure \@ref(fig:chap09-fig-loess) shows that age is interesting as the relationship is u-shaped. 
The chance of death is higher in the young and the old compared with the middle-aged. 
This will need to be accounted for in any model including age as a predictor. 

### Multicollinearity{#chap09-h2-multicollinearity}
\index{logistic regression@\textbf{logistic regression}!multicollinearity}
\index{collinearity}
\index{multicollinearity}

The presence of two or more highly correlated variables in a regression analysis can cause problems in the results which are generated.
The slopes of lines (coefficients, ORs) can become unstable, which means big shifts in their size with minimal changes to the model or the underlying data. 
The confidence intervals around these coefficients may also be large. 
Definitions of the specifics differ between sources, but there are broadly two situations. 

The first is when two highly correlated variables have been included in a model, sometimes referred to simply as collinearity. 
This can be detected by thinking about which variables may be correlated, and then checking using plotting. 

The second situation is more devious. 
It is where collinearity exists between three or more variables, even when no pair of variables is particularly highly correlated. 
To detect this, we can use a specific metric called the *variance inflation factor*. 

As always though, think clearly about your data and whether there may be duplication of information. 
Have you included a variable which is calculated from other variables already included in the model?
Including body mass index (BMI), weight and height would be problematic, given the first is calculated from the latter two. 

Are you describing a similar piece of information in two different ways?
For instance, all perforated colon cancers are staged T4, so do you include T-stage and the perforation factor?
(Note, not all T4 cancers have perforated.)

The `ggpairs()` function from `library(GGally)` is a good way of visualising all two-way associations (Figure \@ref(fig:chap09-fig-ggpairs)). 

```{r chap09-fig-ggpairs, fig.height=6, fig.width=6, message=FALSE, fig.cap = "Exploring two-way associations within our explanatory variables."}
library(GGally)
explanatory <- c("ulcer.factor", "age", "sex.factor", 
                "year", "t_stage.factor")
melanoma %>% 
  remove_labels() %>%  # ggpairs doesn't work well with labels
  ggpairs(columns = explanatory)
```

\index{functions@\textbf{functions}!ggpairs}

If you have many variables you want to check you can split them up. 

**Continuous to continuous**

Here we're using the same `library(GGally)` code as above, but shortlisting the two categorical variables: age and year (Figure \@ref(fig:chap09-fig-contcont)):

```{r chap09-fig-contcont, fig.width=4, fig.height=4, fig.cap = "Exploring relationships between continuous variables."}
select_explanatory <- c("age", "year")
melanoma %>% 
  remove_labels() %>% 
  ggpairs(columns = select_explanatory)
```

**Continuous to categorical**

Let's use a clever `pivot_longer()` and `facet_wrap()` combination to efficiently plot multiple variables against each other without using `ggpairs()`.
We want to compare everything against, for example, age so we need to include `-age` in the `pivot_longer()` call so it doesn't get lumped up with everything else (Figure \@ref(fig:chap09-fig-contcat)):

```{r chap09-fig-contcat, fig.height=3, fig.width=5, message=FALSE, warning=FALSE, fig.cap = "Exploring associations between continuous and categorical explanatory variables."}
select_explanatory <- c("age", "ulcer.factor", 
                       "sex.factor", "t_stage.factor")

melanoma %>% 
  select(all_of(select_explanatory)) %>% 
  pivot_longer(-age) %>% # pivots all but age into two columns: name and value
  ggplot(aes(value, age)) + 
  geom_boxplot() +
  facet_wrap(~name, scale = "free", ncol = 3) +
  coord_flip()
```

**Categorical to categorical**

```{r fig.height=3, fig.width=5, warning=FALSE}
select_explanatory <- c("ulcer.factor", "sex.factor", "t_stage.factor")

melanoma %>% 
  select(one_of(select_explanatory)) %>% 
  pivot_longer(-sex.factor) %>% 
  ggplot(aes(value, fill = sex.factor)) + 
  geom_bar(position = "fill") +
  ylab("proportion") +
  facet_wrap(~name, scale = "free", ncol = 2) +
  coord_flip()
```

None of the explanatory variables are highly correlated with one another. 

**Variance inflation factor**
\index{functions@\textbf{functions}!vif}

Finally, as a final check for the presence of higher-order correlations, the variance inflation factor can be calculated for each of the terms in a final model. 
In simple language, this is a measure of how much the variance of a particular regression coefficient is increased due to the presence of multicollinearity in the model. 
 
Here is an example. 
*GVIF* stands for generalised variance inflation factor. 
A common rule of thumb is that if this is greater than 5-10 for any variable, then multicollinearity may exist. 
The model should be further explored and the terms removed or reduced. 

```{r}
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age", "sex.factor", 
                "year", "t_stage.factor")
melanoma %>% 
  glmmulti(dependent, explanatory) %>%
  car::vif()
```

We are not trying to over-egg this, but multicollinearity can be important. 
The message as always is the same. 
Understand the underlying data using plotting and tables, and you are unlikely to come unstuck.

## Fitting logistic regression models in base R
\index{logistic regression@\textbf{logistic regression}!model fitting}

The `glm()` stands for `generalised linear model` and is the standard base R approach to logistic regression. 

The `glm()` function has several options and many different types of model can be run. 
For instance, 'Poisson regression' for count data.

To run binary logistic regression use `family = binomial`. 
This defaults to `family = binomial(link = 'logit')`. 
Other link functions exist, such as the `probit` function, but this makes little difference to final conclusions. 

Let's start with a simple univariable model using the classical R approach. 

```{r}
fit1 <- glm(mort_5yr ~ ulcer.factor, data = melanoma, family = binomial)
summary(fit1)
```

\index{functions@\textbf{functions}!glm}

This is the standard R output which you should become familiar with. 
It is included in the previous figures. 
The estimates of the coefficients (slopes) in this output are on the log-odds scale and always will be. 

Easier approaches for doing this in practice are shown below, but for completeness here we will show how to extract the results. 
`str()` shows all the information included in the model object, which is useful for experts but a bit off-putting if you are starting out. 

The coefficients and their 95% confidence intervals can be extracted and exponentiated like this. 

```{r}
coef(fit1) %>% exp()
confint(fit1) %>% exp()
```

\index{functions@\textbf{functions}!coef}
\index{functions@\textbf{functions}!confint}


Note that the 95% confidence interval is between the 2.5% and 97.5% quantiles of the distribution, hence why the results appear in this way. 

A good alternative is the `tidy()` function from the **broom** package. 

```{r}
library(broom)
fit1 %>% 
  tidy(conf.int = TRUE, exp = TRUE)
```

We can see from these results that there is a strong association between tumour ulceration and 5-year mortality (OR 6.68, 95%CI 3.18, 15.18).

Model metrics can be extracted using the `glance()` function. 

```{r}
fit1 %>% 
  glance()
```


## Modelling strategy for binary outcomes
\index{logistic regression@\textbf{logistic regression}!model fitting principles}

A statistical model is a tool to understand the world. 
The better your model describes your data, the more useful it will be.
Fitting a successful statistical model requires decisions around which variables to include in the model. 
Our advice regarding variable selection follows the same lines as in the linear regression chapter. 

1. As few explanatory variables should be used as possible (parsimony);
2. Explanatory variables associated with the outcome variable in previous studies should be accounted for; 
3. Demographic variables should be included in model exploration; 
4. Population stratification should be incorporated if available; 
5. Interactions should be checked and included if influential; 
6. Final model selection should be performed using a "criterion-based approach"
+ minimise the Akaike information criterion (AIC)
+ maximise the c-statistic (area under the receiver operator curve).

We will use these principles through the next section.

## Fitting logistic regression models with finalfit

Our preference in model fitting is now to use our own **finalfit** package. 
It gets us to our results quicker and more easily, and produces our final model tables which go directly into manuscripts for publication (we hope). 

The approach is the same as in linear regression. 
If the outcome variable is correctly specified as a factor, the `finalfit()` function will run a logistic regression model directly.

```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- "ulcer.factor"
melanoma %>% 
  finalfit(dependent, explanatory, metrics = TRUE)
```

```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- "ulcer.factor"
fit <- melanoma %>% 
  finalfit(dependent, explanatory, metrics = TRUE)
fit[[1]] %>% mykable(caption = "Univariable logistic regression: 5-year survival from malignant melanoma by tumour ulceration (fit 1).") %>% 
  column_spec(1, width = "3.5cm")
fit[[2]] %>% mykable(caption = "Model metrics: 5-year survival from malignant melanoma by tumour ulceration (fit 1).", col.names = "") %>% 
  column_spec(1, width = "18cm")
```

### Criterion-based model fitting

Passing `metrics = TRUE` to `finalfit()` gives us a useful list of model fitting parameters. 

We recommend looking at three metrics:

* Akaike information criterion (AIC), which should be minimised,
* C-statistic (area under the receiver operator curve), which should be maximised;
* Hosmer–Lemeshow test, which should be non-significant. 

\newpage

**AIC**
\index{logistic regression@\textbf{logistic regression}!AIC}
\index{AIC}

The AIC has been previously described (Section \@ref(chap07-aic)). 
It provides a measure of model goodness-of-fit - or how well the model fits the available data. 
It is penalised for each additional variable, so should be somewhat robust against over-fitting (when the model starts to describe noise).

**C-statistic**
\index{logistic regression@\textbf{logistic regression}!C-statistic}
\index{C-statistic}

The c-statistic or area under the receiver operator curve (ROC) provides a measure of model 'discrimination'. 
It runs from 0.5 to 1.0, with 0.5 being no better than chance, and 1.0 being perfect fit. 
What the number actually represents can be thought of like this. 
Take our example of death from melanoma. 
If you take a random patient who died and a random patient who did not die, then the c-statistic is the probability that the model predicts that patient 1 is more likely to die than patient 2. 
In our example above, the model should get that correct 72% of the time.  

**Hosmer-Lemeshow test**
\index{logistic regression@\textbf{logistic regression}!Hosmer-Lemeshow test}
\index{Hosmer-Lemeshow test}

If you are interested in using your model for prediction, it is important that it is calibrated correctly. 
Using our example, calibration means that the model accurately predicts death from melanoma when the risk to the patient is low and also accurately predicts death when the risk is high. 
The model should work well across the range of probabilities of death. 
The Hosmer-Lemeshow test assesses this. 
By default, it assesses the predictive accuracy for death in deciles of risk. 
If the model predicts equally well (or badly) at low probabilities compared with high probabilities, the null hypothesis of a difference will be rejected (meaning you get a non-significant p-value).


## Model fitting{#chap09-model-fitting}
\index{logistic regression@\textbf{logistic regression}!model fitting}
\index{logistic regression@\textbf{logistic regression}!finalfit}

Engage with the data and the results when model fitting. 
Do not use automated processes - you have to keep thinking. 

Three things are important to keep looking at:

* what is the association between a particular variable and the outcome (OR and 95%CI);
* how much information is a variable bringing to the model (change in AIC and c-statistic); 
* how much influence does adding a variable have on the effect size of another variable, and in particular my variable of interest (a rule of thumb is seeing a greater than 10% change in the OR of the variable of interest when a new variable is added to the model, suggests the new variable is important). 

We're going to start by including the variables from above which we think are relevant. 

```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age", "sex.factor", "t_stage.factor")
fit2 = melanoma %>% 
  finalfit(dependent, explanatory, metrics = TRUE)
```

```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age", "sex.factor", "t_stage.factor")
fit <- melanoma %>% 
  finalfit(dependent, explanatory, metrics = TRUE)
fit[[1]] %>% mykable(caption = "Multivariable logistic regression: 5-year survival from malignant melanoma (fit 2).") %>% 
  column_spec(1, width = "3.5cm")
fit[[2]] %>% mykable(caption = "Model metrics: 5-year survival from malignant melanoma (fit 2).", col.names = "") %>% 
  column_spec(1, width = "18cm")
```

The model metrics have improved with the AIC decreasing from 192 to 188 and the c-statistic increasing from 0.717 to 0.798. 

Let's consider `age`. 
We may expect age to be associated with the outcome because it so commonly is. 
But there is weak evidence of an association in the univariable analysis. 
We have shown above that the relationship of age to the outcome is not linear, therefore we need to act on this. 

We can either convert age to a categorical variable or include it with a quadratic term ($x^2 + x$, remember parabolas from school?). 

```{r eval=FALSE}
melanoma <- melanoma %>% 
  mutate(
    age.factor = cut(age,
                     breaks = c(0, 25, 50, 75, 100)) %>% 
      ff_label("Age (years)"))

# Add this to relevel:
# fct_relevel("(50,75]")

melanoma %>% 
  finalfit(dependent, c("ulcer.factor", "age.factor"), metrics = TRUE)
```

```{r echo=FALSE, message=FALSE}
melanoma <- melanoma %>% 
  mutate(
    age.factor = cut(age,
                     breaks = c(0, 25, 50, 75, 100)) %>% 
      ff_label("Age (years)"))

# Add this to relevel:
# fct_relevel("(50,75]")

fit <- melanoma %>% 
  finalfit(dependent, c("ulcer.factor", "age.factor"), metrics = TRUE)
fit[[1]] %>% mykable(caption = "Multivariable logistic regression: using cut to convert a continuous variable as a factor (fit 3).") %>% 
  column_spec(1, width = "3.5cm")
fit[[2]] %>% mykable(caption = "Model metrics: using `cut` to convert a continuous variable as a factor (fit 3).", col.names = "") %>% 
  column_spec(1, width = "18cm")
```

There is no strong relationship between the categorical representation of age and the outcome. 
Let's try a quadratic term. 

In base R, a quadratic term is added like this. 

```{r}
glm(mort_5yr ~ ulcer.factor  +I(age^2) + age, 
    data = melanoma, family = binomial) %>% 
  summary()
```

\index{functions@\textbf{functions}!glm}

It can be done in `Finalfit` in a similar manner. 
Note with default univariable model settings, the quadratic and linear terms are considered in separate models, which doesn't make much sense. 

```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "I(age^2)", "age")
melanoma %>% 
  finalfit(dependent, explanatory, metrics = TRUE)
```

```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "I(age^2)", "age")
fit <- melanoma %>% 
  finalfit(dependent, explanatory, metrics = TRUE)

fit[[1]] %>% mykable(caption = "Multivariable logistic regression: including a quadratic term (fit 4).") %>% 
  column_spec(1, width = "3.5cm")
fit[[2]] %>% mykable(caption = "Model metrics: including a quadratic term (fit 4).", col.names = "") %>% 
  column_spec(1, width = "18cm")
```

The AIC is worse when adding age either as a factor or with a quadratic term to the base model. 

One final method to visualise the contribution of a particular variable is to remove it from the full model. 
This is convenient in `Finalfit`. 

```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "sex.factor", "t_stage.factor")

melanoma %>% 
  finalfit(dependent, explanatory, explanatory_multi, 
           keep_models = TRUE, metrics = TRUE)
```

```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "sex.factor", "t_stage.factor")
fit <- melanoma %>% 
  finalfit(dependent, explanatory, explanatory_multi, 
           keep_models = TRUE, metrics = TRUE)
fit[[1]] %>% mykable(caption = "Multivariable logistic regression model: comparing a reduced model in one table (fit 5).") %>% 
  column_spec(1, width = "3.5cm")
fit[[2]] %>% unlist() %>% 
  mykable(caption = "Model metrics: comparing a reduced model in one table (fit 5).", col.names = "") %>% 
  column_spec(1, width = "18cm")
```  
  
The AIC improves when age is removed (186 from 190) at only a small loss in discrimination (0.794 from 0.802). 
Looking at the model table and comparing the full multivariable with the reduced multivariable, there has been a small change in the OR for ulceration, with some of the variation accounted for by  age now being taken up by ulceration. 
This is to be expected, given the association (albeit weak) that we saw earlier between age and ulceration.
Given all this, we will decide not to include age in the model. 

Now what about the variable sex. 
It has a significant association with the outcome in the univariable analysis, but much of this is explained by other variables in multivariable analysis. 
Is it contributing much to the model?

```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "t_stage.factor")

melanoma %>% 
  finalfit(dependent, explanatory, explanatory_multi, 
           keep_models = TRUE, metrics = TRUE)
```

```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
fit <- melanoma %>% 
  finalfit(dependent, explanatory, explanatory_multi, 
           keep_models = TRUE, metrics = TRUE)
fit[[1]] %>% mykable(caption = "Multivariable logistic regression: further reducing the model (fit 6).") %>% 
  column_spec(1, width = "3.5cm")
fit[[2]] %>% unlist() %>% mykable(caption = "Model metrics: further reducing the model (fit 6).", col.names = "") %>% 
  column_spec(1, width = "18cm")
```  
  
By removing sex we have improved the AIC a little (184.4 from 186.1) with a small change in the c-statistic (0.791 from 0.794). 

Looking at the model table, the variation has been taken up mostly by stage 4 disease and a little by ulceration. 
But there has been little change overall. 
We will exclude sex from our final model as well. 

As a final we can check for a first-order interaction between ulceration and T-stage. 
Just to remind us what this means, a significant interaction would mean the effect of, say, ulceration on 5-year mortality would differ by T-stage. 
For instance, perhaps the presence of ulceration confers a much greater risk of death in advanced deep tumours compared with earlier superficial tumours. 

```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor*t_stage.factor")
melanoma %>% 
  finalfit(dependent, explanatory, explanatory_multi, 
           keep_models = TRUE, metrics = TRUE)
```

```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor*t_stage.factor")
melanoma %>% 
  finalfit(dependent, explanatory, explanatory_multi, 
           keep_models = TRUE, add_dependent_label = FALSE) %>% 
  mutate(
    label = c("Ulcerated tumour", 
              "", 
              "T-stage", 
              "",
              "",
              "",
              "UlcerPresent:T2",
              "UlcerPresent:T3",
              "UlcerPresent:T4"),
    `OR (univariable)` = factor(`OR (univariable)`) %>% 
      fct_explicit_na(na_level = "-"),
        `OR (multivariable)` = factor(`OR (multivariable)`) %>% 
      fct_explicit_na(na_level = "-")
  ) %>%
  
  #na.omit() %>% 

  mykable(caption = "Multivariable logistic regression: including an interaction term (fit 7).") %>% 
  column_spec(1, width = "3.5cm")
```

There are no significant interaction terms. 

Our final model table is therefore:

```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor", 
                "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
melanoma %>% 
  finalfit(dependent, explanatory, explanatory_multi, metrics = TRUE)
```

```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
fit <- melanoma %>% 
  finalfit(dependent, explanatory, explanatory_multi, metrics = TRUE)

fit[[1]] %>% mykable(caption = "Multivariable logistic regression: final model (fit 8).") %>% 
  column_spec(1, width = "3.5cm")
fit[[2]] %>% mykable(caption = "Model metrics: final model (fit 8).", col.names = "") %>% 
  column_spec(1, width = "18cm")
```

### Odds ratio plot
\index{logistic regression@\textbf{logistic regression}!odds ratio plot}
\index{plotting@\textbf{plotting}!odds ratio}

```{r fig.height=3, fig.width=7, message=FALSE, warnings=FALSE, fig.cap="Odds ratio plot."}
dependent <- "mort_5yr"
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
melanoma %>% 
  or_plot(dependent, explanatory_multi,
          breaks = c(0.5, 1, 2, 5, 10, 25),
          table_text_size = 3.5,
          title_text_size = 16)
```

\index{functions@\textbf{functions}!or\_plot}
\index{plotting@\textbf{plotting}!or\_plot}

We can conclude that there is evidence of an association between tumour ulceration and 5-year survival which is independent of the tumour depth as captured by T-stage.

## Correlated groups of observations
\index{logistic regression@\textbf{logistic regression}!correlated groups}
\index{logistic regression@\textbf{logistic regression}!mixed effects}
\index{logistic regression@\textbf{logistic regression}!random effects}
\index{logistic regression@\textbf{logistic regression}!multilevel}
\index{logistic regression@\textbf{logistic regression}!hierarchical}

In our modelling strategy above, we mentioned the incorporation of population stratification if available.
What does this mean?

Our regression is seeking to capture the characteristics of particular patients.
These characteristics are made manifest through the slopes of fitted lines - the estimated coefficients (ORs) of particular variables. 
A goal is to estimate these characteristics as precisely as possible. 
Bias can be introduced when correlations between patients are not accounted for. 
Correlations may be as simple as being treated within the same hospital. 
By virtue of this fact, these patients may have commonalities that have not been captured by the observed variables. 

Population characteristics can be incorporated into our models. 
We may not be interested in capturing and measuring the effects themselves, but want to ensure they are accounted for in the analysis. 

One approach is to include grouping variables as `random effects`. 
These may be nested with each other, for example patients within hospitals within countries. 
These are added in addition to the `fixed effects` we have been dealing with up until now. 

These models go under different names including mixed effects model, multilevel model, or hierarchical model. 

Other approaches, such as generalized estimating equations are not dealt with here. 

### Simulate data

Our melanoma dataset doesn't include any higher level structure, so we will simulate this for demonstration purposes. 
We have just randomly allocated 1 of 4 identifiers to each patient below. 

```{r}
# Simulate random hospital identifier
set.seed(1)
melanoma <- melanoma %>% 
	mutate(hospital_id = sample(1:4, 205, replace = TRUE))

melanoma <- melanoma %>%
	mutate(hospital_id = c(rep(1:3, 50), rep(4, 55)))

```


### Plot the data 

We will speak in terms of 'hospitals' now, but the grouping variable(s) could clearly be anything. 

The simplest random effects approach is a 'random intercept model'. 
This allows the intercept of fitted lines to vary by hospital. 
The random intercept model constrains lines to be parallel, in a similar way to the additive models discussed above and in Chapter 7. 

It is harder to demonstrate with binomial data, but we can stratify the 5-year mortality by T-stage (considered as a continuous variable for this purpose).
Note there were no deaths in 'hospital 4' (Figure \@ref(fig:chap09-fig-randeffects)).
We can model this accounting for inter-hospital variation below. 

```{r chap09-fig-randeffects, warning=FALSE, fig.cap = "Investigating random effects by looking at the relationship between the outcome variable (5-year mortality) and T-stage at each hospital."}
melanoma %>% 
    mutate(
    mort_5yr.num = as.numeric(mort_5yr) - 1 # Convert factor to 0 and 1
  ) %>% 
  ggplot(aes(x = as.numeric(t_stage.factor), y = mort_5yr.num)) +
  geom_jitter(width = 0.1, height = 0.1) +
  geom_smooth(method = 'loess', se = FALSE) + 
  facet_wrap(~hospital_id) + 
  labs(x= "T-stage", y = "Mortality (5 y)")

```

### Mixed effects models in base R

There are a number of different packages offering mixed effects modelling in R, our preferred is `lme4`. 

```{r}
library(lme4)
melanoma %>% 
  glmer(mort_5yr ~ t_stage.factor + (1 | hospital_id), 
        data = ., family = "binomial") %>% 
  summary()
```

\index{functions@\textbf{functions}!glmer}

The base R output is similar to `glm()`. 
It includes the standard deviation on the random effects intercept as well. 
Meaning the variation between hospitals being captured by the model. 

The output can be examined using `tidy()` and `glance()` functions as above. 

We find it more straightforward to use finalfit. 

```{r eval=FALSE}
dependent <- "mort_5yr"
explanatory <- "t_stage.factor"
random_effect <- "hospital_id" # Is the same as:
random_effect <- "(1 | hospital_id)"
melanoma %>% 
  finalfit(dependent, explanatory, 
           random_effect = random_effect,
           metrics = TRUE)
```

We can incorporate our (made-up) hospital identifier into our final model from above. 
Using `keep_models = TRUE`, we can compare univariable, multivariable and mixed effects models. 

```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor", 
                "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
random_effect <- "hospital_id"
melanoma %>% 
  finalfit(dependent, explanatory, explanatory_multi, random_effect, 
           keep_models = TRUE,
           metrics = TRUE)
```

```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
random_effect <- "hospital_id"
fit <- melanoma %>% 
  finalfit(dependent, explanatory, explanatory_multi, random_effect, 
           keep_models = TRUE,
           metrics = TRUE)

fit[[1]] %>% mykable(caption = "Multilevel (mixed effects) logistic regression.") %>% 
  column_spec(1, width = "3.5cm")
fit[[2]] %>% unlist() %>% mykable(caption = "Model metrics: multilevel (mixed effects) logistic regression.", col.names = "") %>% 
  column_spec(1, width = "18cm")
```

As can be seen, incorporating the (made-up) hospital identifier has altered our coefficients. 
It has also improved the model discrimination with a c-statistic of 0.830 from 0.802. 
Note that the AIC should not be used to compare mixed effects models estimated in this way with `glm()` models (the former uses a restricted maximum likelihood [REML] approach by default, while `glm()` uses maximum likelihood).

Random slope models are an extension of the random intercept model. 
Here the gradient of the response to a particular variable is allowed to vary by hospital. 
For example, this can be included using `random_effect = "(thickness | hospital_id)"` where the gradient of the continuous variable tumour thickness was allow to vary by hospital. 

As models get more complex, care has to be taken to ensure the underlying data is understood and assumptions are checked. 

Mixed effects modelling is a book in itself and the purpose here is to introduce the concept and provide some approaches for its incorporation. 
Clearly much is written elsewhere for those who are enthusiastic to learn more. 

## Exercises

### Exercise {#chap09-ex1}

Investigate the association between sex and 5-year mortality for patients who have undergone surgery for melanoma. 

First recode the variables as shown in the text, then plot the counts and proportions for 5-year disease-specific mortality in women and men. Is there an association between sex and mortality?

### Exercise {#chap09-ex2}

Make a table showing the relationship between sex and the variables age, T-stage and ulceration. Hint: `summary_factorlist()`. 
Express age in terms of median and interquartile range. Include a statistical comparison. 

What associations do you see?

### Exercise {#chap09-ex3}

Run a logistic regression model for 5-year disease-specific mortality including sex, age, T-stage and ulceration. 

What is the c-statistic for this model?

Is there a relationship between sex and mortality, after adjustment for the other explanatory variables?

### Exercise {#chap09-ex4}

Make an odds ratio plot for this model. 


## Solutions

Solution to Exercise \@ref(chap09-ex1):

```{r fig.height=3, fig.width=5, fig.keep = 'none'}
## Recode
melanoma <- melanoma %>% 
  mutate(sex.factor = factor(sex) %>%          
           fct_recode("Female" = "0",
                      "Male"   = "1") %>% 
           ff_label("Sex"),   
         
         ulcer.factor = factor(ulcer) %>% 
           fct_recode("Present" = "1",
                      "Absent"  = "0") %>% 
           ff_label("Ulcerated tumour"),
         
         age  = ff_label(age,  "Age (years)"),
         year = ff_label(year, "Year"),
         
         status.factor = factor(status) %>% 
           fct_recode("Died melanoma"  = "1",
                      "Alive" = "2",
                      "Died - other" = "3") %>% 
           fct_relevel("Alive") %>% 
           ff_label("Status"),
         
         t_stage.factor = 
           thickness %>% 
           cut(breaks = c(0, 1.0, 2.0, 4.0, 
                          max(thickness, na.rm=TRUE)),
               include.lowest = TRUE)
  )

# Plot
p1 <- melanoma %>% 
  ggplot(aes(x = sex.factor, fill = mort_5yr)) + 
  geom_bar() + 
  theme(legend.position = "none")

p2 <- melanoma %>% 
  ggplot(aes(x = sex.factor, fill = mort_5yr)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")

p1 + p2
```

Solution to Exercise \@ref(chap09-ex2):

```{r message=FALSE, results = "hide"}
## Recode T-stage first
melanoma <- melanoma %>% 
  mutate(
    t_stage.factor = 
      fct_recode(t_stage.factor,
                 T1 = "[0,1]",
                 T2 = "(1,2]",
                 T3 = "(2,4]",
                 T4 = "(4,17.4]") %>% 
      ff_label("T-stage")
  )

dependent = "sex.factor"
explanatory = c("age", "t_stage.factor", "ulcer.factor")
melanoma %>% 
  summary_factorlist(dependent, explanatory, p = TRUE, na_include = TRUE,
                     cont = "median")

# Men have more T4 tumours and they are more likely to be ulcerated. 
```

Solution to Exercise \@ref(chap09-ex3):

```{r message=FALSE, warning=FALSE, results = "hide", fig.keep="none"}
dependent = "mort_5yr"
explanatory = c("sex.factor", "age", "t_stage.factor", "ulcer.factor")
melanoma %>% 
  finalfit(dependent, explanatory, metrics = TRUE)

# c-statistic = 0.798
# In multivariable model, male vs female OR 1.26 (0.57-2.76, p=0.558).
# No relationship after accounting for T-stage and tumour ulceration. 
# Sex is confounded by these two variables. 
```


Solution to Exercise \@ref(chap09-ex4):

```{r fig.keep="none", message=FALSE, warning=FALSE}
dependent = "mort_5yr"
explanatory = c("sex.factor", "age", "t_stage.factor", "ulcer.factor")
melanoma %>% 
  or_plot(dependent, explanatory)
```