-
Notifications
You must be signed in to change notification settings - Fork 90
/
09_logistic_regression.Rmd
1241 lines (943 loc) · 53.4 KB
/
09_logistic_regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Logistic regression{#chap09-h1}
\index{logistic regression@\textbf{logistic regression}}
> All generalizations are false, including this one.
> Mark Twain
```{r echo=FALSE, message=FALSE}
library(knitr)
library(kableExtra)
library(ggplot2)
mykable = function(x, caption = "CAPTION", ...){
kable(x, row.names = FALSE, align = c("l", "l", "r", "r", "r", "r", "r", "r", "r"),
booktabs = TRUE, caption = caption,
linesep = c("", "", "\\addlinespace"), ...) %>%
kable_styling(latex_options = c("scale_down", "hold_position"))
}
theme_set(theme_bw())
```
## Generalised linear modelling
Do not start here!
The material covered in this chapter is best understood after having read linear regression (Chapter \@ref(chap07-h1)) and working with categorical outcome variables (Chapter \@ref(chap08-h1)).
Generalised linear modelling is an extension to the linear modelling we are now familiar with.
It allows the principles of linear regression to be applied when outcomes are not continuous numeric variables.
## Binary logistic regression
\index{logistic regression@\textbf{logistic regression}!binary data}
\index{binary data}
A regression analysis is a statistical approach to estimating the relationships between variables, often by drawing straight lines through data points.
For instance, we may try to predict blood pressure in a group of patients based on their coffee consumption (Figure \@ref(fig:chap07-fig-regression) from Chapter \@ref(chap07-h1)).
As blood pressure and coffee consumption can be considered on a continuous scale, this is an example of simple linear regression.
Logistic regression is an extension of this, where the variable being predicted is *categorical*.
We will focus on binary logistic regression, where the dependent variable has two levels, e.g., yes or no, 0 or 1, dead or alive.
Other types of logistic regression include 'ordinal', when the outcome variable has >2 ordered levels, and 'multinomial', where the outcome variable has >2 levels with no inherent order.
We will only deal with binary logistic regression.
When we use the term 'logistic regression', that is what we are referring to.
We have good reason.
In healthcare we are often interested in an event (like death) occurring or not occurring.
Binary logistic regression can tell us the probability of this outcome occurring in a patient with a particular set of characteristics.
Although in binary logistic regression the outcome must have two levels, remember that the predictors (explanatory variables) can be either continuous or categorical.
<!-- Consider making all "Questions" specific in title, e.g. Coffee and CV event etc -->
### The Question (1)
As in previous chapters, we will use concrete examples when discussing the principles of the approach.
We return to our example of coffee drinking.
Yes, we are a little obsessed with coffee.
Our outcome variable was previously blood pressure.
We will now consider our outcome as the occurrence of a cardiovascular (CV) event over a 10-year period.
A cardiovascular event includes the diagnosis of ischemic heart disease, a heart attack (myocardial infarction), or a stroke (cerebrovascular accident).
The diagnosis of a cardiovascular event is clearly a binary condition, it either happens or it does not.
This is ideal for modelling using binary logistic regression.
But remember, the data are completely simulated and not based on anything in the real world.
This bit is just for fun!
### Odds and probabilities
\index{logistic regression@\textbf{logistic regression}!odds and probabilities}
To understand logistic regression we need to remind ourselves about odds and probability.
Odds and probabilities can get confusing so get them straight with Figure \@ref(fig:chap09-fig-odds).
```{r chap09-fig-odds, echo = FALSE, fig.cap="Probability vs odds."}
knitr::include_graphics("images/chapter09/0_odds.png", auto_pdf = TRUE)
```
In many situations, there is no particular reason to prefer one to the other.
However, humans seem to have a preference for expressing chance in terms of probabilities, while odds have particular mathematical properties that make them useful in regression.
When a probability is 0, the odds are 0.
When a probability is between 0 and 0.5, the odds are less than 1.0 (i.e., less than "1 to 1").
As probability increases from 0.5 to 1.0, the odds increase from 1.0 to approach infinity.
Thus the range of probability is 0 to 1 and the range of odds is 0 to $+\infty$.
Odds and probabilities can easily be interconverted.
For example, if the odds of a patient dying from a disease are 1/3 (in horse racing this is stated as '3 to 1 against'), then the probability of death (also known as risk) is 0.25 (or 25%).
Odds of `1 to 1` equal 50%.
$Odds = \frac{p}{1-p}$, where $p$ is the probability of the outcome occurring.
$Probability = \frac{odds}{odds+1}$.
### Odds ratios
\index{logistic regression@\textbf{logistic regression}!odds ratio}
\index{odds ratio}
Another important term to remind ourselves of is the 'odds ratio'.
Why?
Because in a logistic regression the slopes of fitted lines (coefficients) can be interpreted as odds ratios.
This is very useful when interpreting the association of a particular predictor with an outcome.
For a given categorical predictor such as smoking, the difference in chance of the outcome occurring for smokers vs non-smokers can be expressed as a ratio of odds or odds ratio (Figure \@ref(fig:chap09-fig-or)).
For example, if the odds of a smoker having a CV event are 1.5 and the odds of a non-smoker are 1.0, then the odds of a smoker having an event are 1.5-times greater than a non-smoker, odds ratio = 1.5.
```{r chap09-fig-or, echo = FALSE, fig.cap="Odds ratios."}
knitr::include_graphics("images/chapter09/1_or.png", auto_pdf = TRUE)
```
An alternative is a ratio of probabilities which is called a risk ratio or relative risk.
We will continue to work with odds ratios given they are an important expression of effect size in logistic regression analysis.
### Fitting a regression line
\index{logistic regression@\textbf{logistic regression}!fitted line}
Let's return to the task at hand.
The difficulty in moving from a continuous to a binary outcome variable quickly becomes obvious.
If our $y$-axis only has two values, say 0 and 1, then how can we fit a line through our data points?
An assumption of linear regression is that the dependent variable is continuous, unbounded, and measured on an interval or ratio scale.
Unfortunately, binary dependent variables fulfil none of these requirements.
The answer is what makes logistic regression so useful.
Rather than estimating $y=0$ or $y=1$ from the $x$-axis, we estimate the *probability* of $y=1$.
There is one more difficulty in this though.
Probabilities can only exist for values of 0 to 1.
The probability scale is therefore not linear - straight lines do not make sense on it.
As we saw above, the odds scale runs from 0 to $+\infty$.
But here, probabilities from 0 to 0.5 are squashed into odds of 0 to 1, and probabilities from 0.5 to 1 have the expansive comfort of 1 to $+\infty$.
This is why we fit binary data on a *log-odds scale*.
A log-odds scale sounds incredibly off-putting to non-mathematicians, but it is the perfect solution.
* Log-odds run from $-\infty$ to $+\infty$;
* odds of 1 become log-odds of 0;
* a doubling and a halving of odds represent the same distance on the scale.
```{r}
log(1)
log(2)
log(0.5)
```
I'm sure some are shouting 'obviously' at the page.
That is good!
This is wrapped up in a transformation (a bit like the transformations shown in Section \@ref(chap06-transform)) using the so-called logit function.
This can be skipped with no loss of understanding, but for those who just-gots-to-see, the logit function is,
$\log_e (\frac{p}{1-p})$, where $p$ is the probability of the outcome occurring.
Figure \@ref(fig:chap09-fig-logodds) demonstrates the fitted lines from a logistic regression model of cardiovascular event by coffee consumption, stratified by smoking on the log-odds scale (A) and the probability scale (B).
We could conclude, for instance, that on average, non-smokers who drink 2 cups of coffee per day have a 50% chance of a cardiovascular event.
```{r chap09-fig-logodds, echo = FALSE, fig.cap="A logistic regression model of life-time cardiovascular event occurrence by coffee consumption stratified by smoking (simulated data). Fitted lines plotted on the log-odds scale (A) and probability scale (B). *lines are straight when no polynomials or splines are included in regression. "}
knitr::include_graphics("images/chapter09/2_prob_logodds.png", auto_pdf = TRUE)
```
### The fitted line and the logistic regression equation
Figure \@ref(fig:chap09-fig-equation) links the logistic regression equation, the appearance of the fitted lines on the probability scale, and the output from a standard base `R` analysis.
The dots at the top and bottom of the plot represent whether individual patients have had an event or not. The fitted line, therefore, represents the point-to-point probability of a patient with a particular set of characteristics having the event or not.
Compare this to Figure \@ref(fig:chap07-fig-equation) to be clear on the difference.
The slope of the line is linear on the log-odds scale and these are presented in the output on the log-odds scale.
Thankfully, it is straightforward to convert these to odds ratios, a measure we can use to communicate effect size and direction effectively.
Said in more technical language, the exponential of the coefficient on the log-odds scale can be interpreted as an odds ratio.
For a continuous variable such as cups of coffee consumed, the odds ratio is the change in odds of a CV event associated with a 1 cup increase in coffee consumption.
We are dealing with linear responses here, so the odds ratio is the same for an increase from 1 to 2 cups, or 3 to 4 cups etc.
Remember that if the odds ratio for 1 unit of change is 1.5, then the odds ratio for 2 units of change is $exp(log(1.5)*2) = 2.25$.
For a categorical variable such as smoking, the odds ratio is the change in odds of a CV event associated with smoking compared with not smoking (the reference level).
```{r chap09-fig-equation, echo = FALSE, fig.cap="Linking the logistic regression fitted line and equation (A) with the R output (B)."}
knitr::include_graphics("images/chapter09/4_equation.png", auto_pdf = TRUE)
```
### Effect modification and confounding
\index{logistic regression@\textbf{logistic regression}!effect modification}
\index{logistic regression@\textbf{logistic regression}!confounding}
\index{logistic regression@\textbf{logistic regression}!interactions}
As with all multivariable regression models, logistic regression allows the incorporation of multiple variables which all may have direct effects on outcome or may confound the effect of another variable.
This was explored in detail in Section \@ref(chap07-confound); all of the same principles apply.
Adjusting for effect modification and confounding allows us to isolate the direct effect of an explanatory variable of interest upon an outcome.
In our example, we are interested in direct effect of coffee drinking on the occurrence of cardiovascular disease, independent of any association between coffee drinking and smoking.
Figure \@ref(fig:chap09-fig-types) demonstrates simple, additive and multiplicative models.
Think back to Figure \@ref(fig:chap07-fig-types) and the discussion around it as these terms are easier to think about when looking at the linear regression example, but essentially they work the same way in logistic regression.
Presented on the probability scale, the effect of the interaction is difficult to see.
It is obvious on the log-odds scale that the fitted lines are no longer constrained to be parallel.
```{r chap09-fig-types, echo = FALSE, fig.cap="Multivariable logistic regression (A) with additive (B) and multiplicative (C) effect modification."}
knitr::include_graphics("images/chapter09/6_types.png", auto_pdf = TRUE)
```
The interpretation of the interaction term is important.
The exponential of the interaction coefficient term represents a 'ratio-of-odds ratios'.
This is easiest to see through a worked example.
In Figure \@ref(fig:chap09-fig-interaction) the effect of coffee on the odds of a cardiovascular event can be compared in smokers and non-smokers.
The effect is now different given the inclusion of a significant interaction term.
Please check back to the linear regression chapter if this is not making sense.
```{r chap09-fig-interaction, echo = FALSE, fig.cap="Multivariable logistic regression with interaction term. The exponential of the interaction term is a ratio-of-odds ratios (ROR)."}
knitr::include_graphics("images/chapter09/7_interactions.png", auto_pdf = TRUE)
```
## Data preparation and exploratory analysis
### The Question (2)
We will go on to explore the `boot::melanoma` dataset introduced in Chapter \@ref(chap08-h1).
The data consist of measurements made on patients after surgery to remove the melanoma skin cancer in the University Hospital of Odense, Denmark, between 1962 and 1977.
Malignant melanoma is an aggressive and highly invasive cancer, making it difficult to treat.
To determine how advanced it is, staging is based on the depth of the tumour.
The current TNM classification cut-offs are:
- T1: $\leq$ 1.0 mm depth
- T2: 1.1 to 2.0 mm depth
- T3: 2.1 to 4.0 mm depth
- T4: > 4.0 mm depth
This will be important in our analysis as we will create a new variable based upon this.
Using logistic regression, we will investigate factors associated with death from malignant melanoma with particular interest in tumour ulceration.
### Get the data
The Help page (F1 on `boot::melanoma`) gives us its data dictionary including the definition of each variable and the coding used.
```{r, message = F}
melanoma <- boot::melanoma
```
### Check the data
As before, always carefully check and clean new dataset before you start the analysis.
```{r eval=FALSE}
library(tidyverse)
library(finalfit)
melanoma %>% glimpse()
melanoma %>% ff_glimpse()
```
### Recode the data
We have seen some of this already (Section \@ref(chap08-recode): Recode data), but for this particular analysis we will recode some further variables.
```{r message=FALSE}
library(tidyverse)
library(finalfit)
melanoma <- melanoma %>%
mutate(sex.factor = factor(sex) %>%
fct_recode("Female" = "0",
"Male" = "1") %>%
ff_label("Sex"),
ulcer.factor = factor(ulcer) %>%
fct_recode("Present" = "1",
"Absent" = "0") %>%
ff_label("Ulcerated tumour"),
age = ff_label(age, "Age (years)"),
year = ff_label(year, "Year"),
status.factor = factor(status) %>%
fct_recode("Died melanoma" = "1",
"Alive" = "2",
"Died - other" = "3") %>%
fct_relevel("Alive") %>%
ff_label("Status"),
t_stage.factor =
thickness %>%
cut(breaks = c(0, 1.0, 2.0, 4.0,
max(thickness, na.rm=TRUE)),
include.lowest = TRUE)
)
```
Check the `cut()` function has worked:
```{r}
melanoma$t_stage.factor %>% levels()
```
Recode for ease.
```{r}
melanoma <- melanoma %>%
mutate(
t_stage.factor =
fct_recode(t_stage.factor,
"T1" = "[0,1]",
"T2" = "(1,2]",
"T3" = "(2,4]",
"T4" = "(4,17.4]") %>%
ff_label("T-stage")
)
```
We will now consider our outcome variable.
With a binary outcome and health data, we often have to make a decision as to *when* to determine if that variable has occurred or not.
In the next chapter we will look at survival analysis where this requirement is not needed.
Our outcome of interest is death from melanoma, but we need to decide when to define this.
A quick histogram of `time` stratified by `status.factor` helps.
We can see that most people who died from melanoma did so before 5 years (Figure \@ref(fig:chap09-fig-morthist)).
We can also see that the status most of those who did not die is known beyond 5 years.
```{r chap09-fig-morthist, fig.height=3.5, fig.width=6, message=TRUE, fig.cap = "Time to outcome/follow-up times for patients in the melanoma dataset."}
library(ggplot2)
melanoma %>%
ggplot(aes(x = time/365)) +
geom_histogram() +
facet_grid(. ~ status.factor)
```
Let's decide then to look at 5-year mortality from melanoma.
The definition of this will be at 5 years after surgery, who had died from melanoma and who had not.
```{r}
# 5-year mortality
melanoma <- melanoma %>%
mutate(
mort_5yr =
if_else((time/365) < 5 &
(status == 1),
"Yes", # then
"No") %>% # else
fct_relevel("No") %>%
ff_label("5-year survival")
)
```
### Plot the data
We are interested in the association between tumour ulceration and outcome (Figure \@ref(fig:chap09-fig-ulceration)).
```{r chap09-fig-ulceration, fig.height=3, fig.width=5, fig.cap = "Exploration ulceration and outcome (5-year mortality)."}
p1 <- melanoma %>%
ggplot(aes(x = ulcer.factor, fill = mort_5yr)) +
geom_bar() +
theme(legend.position = "none")
p2 <- melanoma %>%
ggplot(aes(x = ulcer.factor, fill = mort_5yr)) +
geom_bar(position = "fill") +
ylab("proportion")
library(patchwork)
p1 + p2
```
As we might have anticipated from our work in the previous chapter, 5-year mortality is higher in patients with ulcerated tumours compared with those with non-ulcerated tumours.
We are also interested in other variables that may be associated with tumour ulceration.
If they are also associated with our outcome, then they will confound the estimate of the direct effect of tumour ulceration.
We can plot out these relationships, or tabulate them instead.
### Tabulate data
We will use the convenient `summary_factorlist()` function from the `finalfit` package to look for differences across other variables by tumour ulceration.
```{r, eval=FALSE}
library(finalfit)
dependent <- "ulcer.factor"
explanatory <- c("age", "sex.factor", "year", "t_stage.factor")
melanoma %>%
summary_factorlist(dependent, explanatory, p = TRUE,
add_dependent_label = TRUE)
```
```{r, echo=FALSE}
library(finalfit)
dependent <- "ulcer.factor"
explanatory <- c("age", "sex.factor", "year", "t_stage.factor")
melanoma %>%
summary_factorlist(dependent, explanatory, p = TRUE,
column = TRUE, # proportions by column
add_dependent_label = TRUE) %>%
mykable(caption="Multiple variables by explanatory variable of interest: Malignant melanoma ulceration by patient and disease variables.") %>%
column_spec(1, width = "3.5cm")
```
It appears that patients with ulcerated tumours were older, more likely to be male, and had thicker/higher stage tumours.
It is important therefore to consider inclusion of these variables in a regression model.
## Model assumptions
\index{logistic regression@\textbf{logistic regression}!assumptions}
Binary logistic regression is robust to many of the assumptions which cause problems in other statistical analyses.
The main assumptions are:
1. Binary dependent variable - this is obvious, but as above we need to check (alive, death from disease, death from other causes doesn't work);
2. Independence of observations - the observations should not be repeated measurements or matched data;
3. Linearity of continuous explanatory variables and the log-odds outcome - take age as an example. If the outcome, say death, gets more frequent or less frequent as age rises, the model will work well. However, say children and the elderly are at high risk of death, but those in middle years are not, then the relationship is not linear. Or more correctly, it is not monotonic, meaning that the response does not only go in one direction;
4. No multicollinearity - explanatory variables should not be highly correlated with each other.
### Linearity of continuous variables to the response
\index{logistic regression@\textbf{logistic regression}!loess}
A graphical check of linearity can be performed using a best fit "loess" line.
This is on the probability scale, so it is not going to be straight.
But it should be monotonic - it should only ever go up or down.
```{r chap09-fig-loess, fig.height=3, fig.width=5.5, message=FALSE, warning=FALSE, fig.cap = "Linearity of our continuous explanatory variables to the outcome (5-year mortality)."}
melanoma %>%
mutate(
mort_5yr.num = as.numeric(mort_5yr) - 1
) %>%
select(mort_5yr.num, age, year) %>%
pivot_longer(all_of(c("age", "year")), names_to = "predictors") %>%
ggplot(aes(x = value, y = mort_5yr.num)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
facet_wrap(~predictors, scales = "free_x")
```
Figure \@ref(fig:chap09-fig-loess) shows that age is interesting as the relationship is u-shaped.
The chance of death is higher in the young and the old compared with the middle-aged.
This will need to be accounted for in any model including age as a predictor.
### Multicollinearity{#chap09-h2-multicollinearity}
\index{logistic regression@\textbf{logistic regression}!multicollinearity}
\index{collinearity}
\index{multicollinearity}
The presence of two or more highly correlated variables in a regression analysis can cause problems in the results which are generated.
The slopes of lines (coefficients, ORs) can become unstable, which means big shifts in their size with minimal changes to the model or the underlying data.
The confidence intervals around these coefficients may also be large.
Definitions of the specifics differ between sources, but there are broadly two situations.
The first is when two highly correlated variables have been included in a model, sometimes referred to simply as collinearity.
This can be detected by thinking about which variables may be correlated, and then checking using plotting.
The second situation is more devious.
It is where collinearity exists between three or more variables, even when no pair of variables is particularly highly correlated.
To detect this, we can use a specific metric called the *variance inflation factor*.
As always though, think clearly about your data and whether there may be duplication of information.
Have you included a variable which is calculated from other variables already included in the model?
Including body mass index (BMI), weight and height would be problematic, given the first is calculated from the latter two.
Are you describing a similar piece of information in two different ways?
For instance, all perforated colon cancers are staged T4, so do you include T-stage and the perforation factor?
(Note, not all T4 cancers have perforated.)
The `ggpairs()` function from `library(GGally)` is a good way of visualising all two-way associations (Figure \@ref(fig:chap09-fig-ggpairs)).
```{r chap09-fig-ggpairs, fig.height=6, fig.width=6, message=FALSE, fig.cap = "Exploring two-way associations within our explanatory variables."}
library(GGally)
explanatory <- c("ulcer.factor", "age", "sex.factor",
"year", "t_stage.factor")
melanoma %>%
remove_labels() %>% # ggpairs doesn't work well with labels
ggpairs(columns = explanatory)
```
\index{functions@\textbf{functions}!ggpairs}
If you have many variables you want to check you can split them up.
**Continuous to continuous**
Here we're using the same `library(GGally)` code as above, but shortlisting the two categorical variables: age and year (Figure \@ref(fig:chap09-fig-contcont)):
```{r chap09-fig-contcont, fig.width=4, fig.height=4, fig.cap = "Exploring relationships between continuous variables."}
select_explanatory <- c("age", "year")
melanoma %>%
remove_labels() %>%
ggpairs(columns = select_explanatory)
```
**Continuous to categorical**
Let's use a clever `pivot_longer()` and `facet_wrap()` combination to efficiently plot multiple variables against each other without using `ggpairs()`.
We want to compare everything against, for example, age so we need to include `-age` in the `pivot_longer()` call so it doesn't get lumped up with everything else (Figure \@ref(fig:chap09-fig-contcat)):
```{r chap09-fig-contcat, fig.height=3, fig.width=5, message=FALSE, warning=FALSE, fig.cap = "Exploring associations between continuous and categorical explanatory variables."}
select_explanatory <- c("age", "ulcer.factor",
"sex.factor", "t_stage.factor")
melanoma %>%
select(all_of(select_explanatory)) %>%
pivot_longer(-age) %>% # pivots all but age into two columns: name and value
ggplot(aes(value, age)) +
geom_boxplot() +
facet_wrap(~name, scale = "free", ncol = 3) +
coord_flip()
```
**Categorical to categorical**
```{r fig.height=3, fig.width=5, warning=FALSE}
select_explanatory <- c("ulcer.factor", "sex.factor", "t_stage.factor")
melanoma %>%
select(one_of(select_explanatory)) %>%
pivot_longer(-sex.factor) %>%
ggplot(aes(value, fill = sex.factor)) +
geom_bar(position = "fill") +
ylab("proportion") +
facet_wrap(~name, scale = "free", ncol = 2) +
coord_flip()
```
None of the explanatory variables are highly correlated with one another.
**Variance inflation factor**
\index{functions@\textbf{functions}!vif}
Finally, as a final check for the presence of higher-order correlations, the variance inflation factor can be calculated for each of the terms in a final model.
In simple language, this is a measure of how much the variance of a particular regression coefficient is increased due to the presence of multicollinearity in the model.
Here is an example.
*GVIF* stands for generalised variance inflation factor.
A common rule of thumb is that if this is greater than 5-10 for any variable, then multicollinearity may exist.
The model should be further explored and the terms removed or reduced.
```{r}
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age", "sex.factor",
"year", "t_stage.factor")
melanoma %>%
glmmulti(dependent, explanatory) %>%
car::vif()
```
We are not trying to over-egg this, but multicollinearity can be important.
The message as always is the same.
Understand the underlying data using plotting and tables, and you are unlikely to come unstuck.
## Fitting logistic regression models in base R
\index{logistic regression@\textbf{logistic regression}!model fitting}
The `glm()` stands for `generalised linear model` and is the standard base R approach to logistic regression.
The `glm()` function has several options and many different types of model can be run.
For instance, 'Poisson regression' for count data.
To run binary logistic regression use `family = binomial`.
This defaults to `family = binomial(link = 'logit')`.
Other link functions exist, such as the `probit` function, but this makes little difference to final conclusions.
Let's start with a simple univariable model using the classical R approach.
```{r}
fit1 <- glm(mort_5yr ~ ulcer.factor, data = melanoma, family = binomial)
summary(fit1)
```
\index{functions@\textbf{functions}!glm}
This is the standard R output which you should become familiar with.
It is included in the previous figures.
The estimates of the coefficients (slopes) in this output are on the log-odds scale and always will be.
Easier approaches for doing this in practice are shown below, but for completeness here we will show how to extract the results.
`str()` shows all the information included in the model object, which is useful for experts but a bit off-putting if you are starting out.
The coefficients and their 95% confidence intervals can be extracted and exponentiated like this.
```{r}
coef(fit1) %>% exp()
confint(fit1) %>% exp()
```
\index{functions@\textbf{functions}!coef}
\index{functions@\textbf{functions}!confint}
Note that the 95% confidence interval is between the 2.5% and 97.5% quantiles of the distribution, hence why the results appear in this way.
A good alternative is the `tidy()` function from the **broom** package.
```{r}
library(broom)
fit1 %>%
tidy(conf.int = TRUE, exp = TRUE)
```
We can see from these results that there is a strong association between tumour ulceration and 5-year mortality (OR 6.68, 95%CI 3.18, 15.18).
Model metrics can be extracted using the `glance()` function.
```{r}
fit1 %>%
glance()
```
## Modelling strategy for binary outcomes
\index{logistic regression@\textbf{logistic regression}!model fitting principles}
A statistical model is a tool to understand the world.
The better your model describes your data, the more useful it will be.
Fitting a successful statistical model requires decisions around which variables to include in the model.
Our advice regarding variable selection follows the same lines as in the linear regression chapter.
1. As few explanatory variables should be used as possible (parsimony);
2. Explanatory variables associated with the outcome variable in previous studies should be accounted for;
3. Demographic variables should be included in model exploration;
4. Population stratification should be incorporated if available;
5. Interactions should be checked and included if influential;
6. Final model selection should be performed using a "criterion-based approach"
+ minimise the Akaike information criterion (AIC)
+ maximise the c-statistic (area under the receiver operator curve).
We will use these principles through the next section.
## Fitting logistic regression models with finalfit
Our preference in model fitting is now to use our own **finalfit** package.
It gets us to our results quicker and more easily, and produces our final model tables which go directly into manuscripts for publication (we hope).
The approach is the same as in linear regression.
If the outcome variable is correctly specified as a factor, the `finalfit()` function will run a logistic regression model directly.
```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- "ulcer.factor"
melanoma %>%
finalfit(dependent, explanatory, metrics = TRUE)
```
```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- "ulcer.factor"
fit <- melanoma %>%
finalfit(dependent, explanatory, metrics = TRUE)
fit[[1]] %>% mykable(caption = "Univariable logistic regression: 5-year survival from malignant melanoma by tumour ulceration (fit 1).") %>%
column_spec(1, width = "3.5cm")
fit[[2]] %>% mykable(caption = "Model metrics: 5-year survival from malignant melanoma by tumour ulceration (fit 1).", col.names = "") %>%
column_spec(1, width = "18cm")
```
### Criterion-based model fitting
Passing `metrics = TRUE` to `finalfit()` gives us a useful list of model fitting parameters.
We recommend looking at three metrics:
* Akaike information criterion (AIC), which should be minimised,
* C-statistic (area under the receiver operator curve), which should be maximised;
* Hosmer–Lemeshow test, which should be non-significant.
\newpage
**AIC**
\index{logistic regression@\textbf{logistic regression}!AIC}
\index{AIC}
The AIC has been previously described (Section \@ref(chap07-aic)).
It provides a measure of model goodness-of-fit - or how well the model fits the available data.
It is penalised for each additional variable, so should be somewhat robust against over-fitting (when the model starts to describe noise).
**C-statistic**
\index{logistic regression@\textbf{logistic regression}!C-statistic}
\index{C-statistic}
The c-statistic or area under the receiver operator curve (ROC) provides a measure of model 'discrimination'.
It runs from 0.5 to 1.0, with 0.5 being no better than chance, and 1.0 being perfect fit.
What the number actually represents can be thought of like this.
Take our example of death from melanoma.
If you take a random patient who died and a random patient who did not die, then the c-statistic is the probability that the model predicts that patient 1 is more likely to die than patient 2.
In our example above, the model should get that correct 72% of the time.
**Hosmer-Lemeshow test**
\index{logistic regression@\textbf{logistic regression}!Hosmer-Lemeshow test}
\index{Hosmer-Lemeshow test}
If you are interested in using your model for prediction, it is important that it is calibrated correctly.
Using our example, calibration means that the model accurately predicts death from melanoma when the risk to the patient is low and also accurately predicts death when the risk is high.
The model should work well across the range of probabilities of death.
The Hosmer-Lemeshow test assesses this.
By default, it assesses the predictive accuracy for death in deciles of risk.
If the model predicts equally well (or badly) at low probabilities compared with high probabilities, the null hypothesis of a difference will be rejected (meaning you get a non-significant p-value).
## Model fitting{#chap09-model-fitting}
\index{logistic regression@\textbf{logistic regression}!model fitting}
\index{logistic regression@\textbf{logistic regression}!finalfit}
Engage with the data and the results when model fitting.
Do not use automated processes - you have to keep thinking.
Three things are important to keep looking at:
* what is the association between a particular variable and the outcome (OR and 95%CI);
* how much information is a variable bringing to the model (change in AIC and c-statistic);
* how much influence does adding a variable have on the effect size of another variable, and in particular my variable of interest (a rule of thumb is seeing a greater than 10% change in the OR of the variable of interest when a new variable is added to the model, suggests the new variable is important).
We're going to start by including the variables from above which we think are relevant.
```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age", "sex.factor", "t_stage.factor")
fit2 = melanoma %>%
finalfit(dependent, explanatory, metrics = TRUE)
```
```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age", "sex.factor", "t_stage.factor")
fit <- melanoma %>%
finalfit(dependent, explanatory, metrics = TRUE)
fit[[1]] %>% mykable(caption = "Multivariable logistic regression: 5-year survival from malignant melanoma (fit 2).") %>%
column_spec(1, width = "3.5cm")
fit[[2]] %>% mykable(caption = "Model metrics: 5-year survival from malignant melanoma (fit 2).", col.names = "") %>%
column_spec(1, width = "18cm")
```
The model metrics have improved with the AIC decreasing from 192 to 188 and the c-statistic increasing from 0.717 to 0.798.
Let's consider `age`.
We may expect age to be associated with the outcome because it so commonly is.
But there is weak evidence of an association in the univariable analysis.
We have shown above that the relationship of age to the outcome is not linear, therefore we need to act on this.
We can either convert age to a categorical variable or include it with a quadratic term ($x^2 + x$, remember parabolas from school?).
```{r eval=FALSE}
melanoma <- melanoma %>%
mutate(
age.factor = cut(age,
breaks = c(0, 25, 50, 75, 100)) %>%
ff_label("Age (years)"))
# Add this to relevel:
# fct_relevel("(50,75]")
melanoma %>%
finalfit(dependent, c("ulcer.factor", "age.factor"), metrics = TRUE)
```
```{r echo=FALSE, message=FALSE}
melanoma <- melanoma %>%
mutate(
age.factor = cut(age,
breaks = c(0, 25, 50, 75, 100)) %>%
ff_label("Age (years)"))
# Add this to relevel:
# fct_relevel("(50,75]")
fit <- melanoma %>%
finalfit(dependent, c("ulcer.factor", "age.factor"), metrics = TRUE)
fit[[1]] %>% mykable(caption = "Multivariable logistic regression: using cut to convert a continuous variable as a factor (fit 3).") %>%
column_spec(1, width = "3.5cm")
fit[[2]] %>% mykable(caption = "Model metrics: using `cut` to convert a continuous variable as a factor (fit 3).", col.names = "") %>%
column_spec(1, width = "18cm")
```
There is no strong relationship between the categorical representation of age and the outcome.
Let's try a quadratic term.
In base R, a quadratic term is added like this.
```{r}
glm(mort_5yr ~ ulcer.factor +I(age^2) + age,
data = melanoma, family = binomial) %>%
summary()
```
\index{functions@\textbf{functions}!glm}
It can be done in `Finalfit` in a similar manner.
Note with default univariable model settings, the quadratic and linear terms are considered in separate models, which doesn't make much sense.
```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "I(age^2)", "age")
melanoma %>%
finalfit(dependent, explanatory, metrics = TRUE)
```
```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "I(age^2)", "age")
fit <- melanoma %>%
finalfit(dependent, explanatory, metrics = TRUE)
fit[[1]] %>% mykable(caption = "Multivariable logistic regression: including a quadratic term (fit 4).") %>%
column_spec(1, width = "3.5cm")
fit[[2]] %>% mykable(caption = "Model metrics: including a quadratic term (fit 4).", col.names = "") %>%
column_spec(1, width = "18cm")
```
The AIC is worse when adding age either as a factor or with a quadratic term to the base model.
One final method to visualise the contribution of a particular variable is to remove it from the full model.
This is convenient in `Finalfit`.
```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "sex.factor", "t_stage.factor")
melanoma %>%
finalfit(dependent, explanatory, explanatory_multi,
keep_models = TRUE, metrics = TRUE)
```
```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "sex.factor", "t_stage.factor")
fit <- melanoma %>%
finalfit(dependent, explanatory, explanatory_multi,
keep_models = TRUE, metrics = TRUE)
fit[[1]] %>% mykable(caption = "Multivariable logistic regression model: comparing a reduced model in one table (fit 5).") %>%
column_spec(1, width = "3.5cm")
fit[[2]] %>% unlist() %>%
mykable(caption = "Model metrics: comparing a reduced model in one table (fit 5).", col.names = "") %>%
column_spec(1, width = "18cm")
```
The AIC improves when age is removed (186 from 190) at only a small loss in discrimination (0.794 from 0.802).
Looking at the model table and comparing the full multivariable with the reduced multivariable, there has been a small change in the OR for ulceration, with some of the variation accounted for by age now being taken up by ulceration.
This is to be expected, given the association (albeit weak) that we saw earlier between age and ulceration.
Given all this, we will decide not to include age in the model.
Now what about the variable sex.
It has a significant association with the outcome in the univariable analysis, but much of this is explained by other variables in multivariable analysis.
Is it contributing much to the model?
```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
melanoma %>%
finalfit(dependent, explanatory, explanatory_multi,
keep_models = TRUE, metrics = TRUE)
```
```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
fit <- melanoma %>%
finalfit(dependent, explanatory, explanatory_multi,
keep_models = TRUE, metrics = TRUE)
fit[[1]] %>% mykable(caption = "Multivariable logistic regression: further reducing the model (fit 6).") %>%
column_spec(1, width = "3.5cm")
fit[[2]] %>% unlist() %>% mykable(caption = "Model metrics: further reducing the model (fit 6).", col.names = "") %>%
column_spec(1, width = "18cm")
```
By removing sex we have improved the AIC a little (184.4 from 186.1) with a small change in the c-statistic (0.791 from 0.794).
Looking at the model table, the variation has been taken up mostly by stage 4 disease and a little by ulceration.
But there has been little change overall.
We will exclude sex from our final model as well.
As a final we can check for a first-order interaction between ulceration and T-stage.
Just to remind us what this means, a significant interaction would mean the effect of, say, ulceration on 5-year mortality would differ by T-stage.
For instance, perhaps the presence of ulceration confers a much greater risk of death in advanced deep tumours compared with earlier superficial tumours.
```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor*t_stage.factor")
melanoma %>%
finalfit(dependent, explanatory, explanatory_multi,
keep_models = TRUE, metrics = TRUE)
```
```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor*t_stage.factor")
melanoma %>%
finalfit(dependent, explanatory, explanatory_multi,
keep_models = TRUE, add_dependent_label = FALSE) %>%
mutate(
label = c("Ulcerated tumour",
"",
"T-stage",
"",
"",
"",
"UlcerPresent:T2",
"UlcerPresent:T3",
"UlcerPresent:T4"),
`OR (univariable)` = factor(`OR (univariable)`) %>%
fct_explicit_na(na_level = "-"),
`OR (multivariable)` = factor(`OR (multivariable)`) %>%
fct_explicit_na(na_level = "-")
) %>%
#na.omit() %>%
mykable(caption = "Multivariable logistic regression: including an interaction term (fit 7).") %>%
column_spec(1, width = "3.5cm")
```
There are no significant interaction terms.
Our final model table is therefore:
```{r eval=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor",
"sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
melanoma %>%
finalfit(dependent, explanatory, explanatory_multi, metrics = TRUE)
```
```{r echo=FALSE, message=FALSE}
library(finalfit)
dependent <- "mort_5yr"
explanatory <- c("ulcer.factor", "age.factor", "sex.factor", "t_stage.factor")
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
fit <- melanoma %>%
finalfit(dependent, explanatory, explanatory_multi, metrics = TRUE)
fit[[1]] %>% mykable(caption = "Multivariable logistic regression: final model (fit 8).") %>%
column_spec(1, width = "3.5cm")
fit[[2]] %>% mykable(caption = "Model metrics: final model (fit 8).", col.names = "") %>%
column_spec(1, width = "18cm")
```
### Odds ratio plot
\index{logistic regression@\textbf{logistic regression}!odds ratio plot}
\index{plotting@\textbf{plotting}!odds ratio}
```{r fig.height=3, fig.width=7, message=FALSE, warnings=FALSE, fig.cap="Odds ratio plot."}
dependent <- "mort_5yr"
explanatory_multi <- c("ulcer.factor", "t_stage.factor")
melanoma %>%
or_plot(dependent, explanatory_multi,
breaks = c(0.5, 1, 2, 5, 10, 25),
table_text_size = 3.5,
title_text_size = 16)
```
\index{functions@\textbf{functions}!or\_plot}
\index{plotting@\textbf{plotting}!or\_plot}
We can conclude that there is evidence of an association between tumour ulceration and 5-year survival which is independent of the tumour depth as captured by T-stage.
## Correlated groups of observations
\index{logistic regression@\textbf{logistic regression}!correlated groups}
\index{logistic regression@\textbf{logistic regression}!mixed effects}
\index{logistic regression@\textbf{logistic regression}!random effects}
\index{logistic regression@\textbf{logistic regression}!multilevel}
\index{logistic regression@\textbf{logistic regression}!hierarchical}
In our modelling strategy above, we mentioned the incorporation of population stratification if available.
What does this mean?
Our regression is seeking to capture the characteristics of particular patients.
These characteristics are made manifest through the slopes of fitted lines - the estimated coefficients (ORs) of particular variables.
A goal is to estimate these characteristics as precisely as possible.
Bias can be introduced when correlations between patients are not accounted for.
Correlations may be as simple as being treated within the same hospital.
By virtue of this fact, these patients may have commonalities that have not been captured by the observed variables.
Population characteristics can be incorporated into our models.
We may not be interested in capturing and measuring the effects themselves, but want to ensure they are accounted for in the analysis.
One approach is to include grouping variables as `random effects`.
These may be nested with each other, for example patients within hospitals within countries.
These are added in addition to the `fixed effects` we have been dealing with up until now.
These models go under different names including mixed effects model, multilevel model, or hierarchical model.
Other approaches, such as generalized estimating equations are not dealt with here.
### Simulate data
Our melanoma dataset doesn't include any higher level structure, so we will simulate this for demonstration purposes.