forked from mortenarendt/dataanalyssisconsumerscience
-
Notifications
You must be signed in to change notification settings - Fork 0
/
12_LogisticRegression.Rmd
366 lines (235 loc) · 13.8 KB
/
12_LogisticRegression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
# Logistic Regression
Logistic regression is regression towards a binary response.
In short regression were developed towards continuous responses, and is what is supported in R using the *lm()* function. However, there exits other type of responses, including binary, counts, etc. where the aim of the statistical analysis is still to use a regression type of model. This has given rise to the use of generalized linear models, where several response types can be analysis in a regression framework. In R this is done using the *glm()* function, extended with a specification of the response type using the _family_ argument.
Logistic regression aims to model the probability of observing case (versus control) given the input. The outputs from such a model is odds ratios.
This video takes you through some basics. If you want to dive directly at logistic regression, then dive in from *6.53*.
```{r, echo=FALSE}
vembedr::embed_youtube("https://www.youtube.com/watch?v=C4N3_XJJ-jU")
# vembedr::embed_youtube("https://www.youtube.com/watch?v=DBqdQekpR5M")
# vembedr::embed_youtube("https://www.youtube.com/watch?v=SOYt84ZPTx0&list=PL4L59zaizb3FmBdxuDLRdzGsknTrZN6Ys")
# vembedr::embed_youtube("https://www.youtube.com/watch?v=C4N3_XJJ-jU&list=PL1328115D3D8A2566&index=3")
```
In this chapter, we are going to use logistic regression for characterizing clusters. This is simply done by constructing binary classes based on the clusters as **belong to cluster k** versus **not belong to cluster k**. Even with more than two clusters binary endpoints can be constructed in this way. Alternatively, if there is one of the clusters which naturally is a reference cluster, then comparing the individual clusters to this one is also a way forward.
The data used in this chapter is from the paper: *Verbeke, Wim, Federico JA Pérez-Cueto, and Klaus G. Grunert. "To eat or not to eat pork, how frequently and how varied? Insights from the quantitative Q-PorkChains consumer survey in four European countries." Meat science 88.4 (2011): 619-626.* and can be found in the *data4consumerscience*-package as *pork*.
```{r}
library(data4consumerscience)
data(pork)
```
## Segmentation/Clustering
In the dataset there are three clusters encoded as binary membership variables: cluster1, cluster2 and cluster3. They are mutually exclusive.
These clusters are based on another clustering procedure than k-means, but in principle any clustering can be used to generate clusters... as long as they are meaningful.
Here we just plot them to see how they appear on the two variables used for their construction.
```{r, fig.height=3.5, fig.width=4.5}
library(ggplot2)
library(tableone)
pork$clusters <- pork$cluster1*1 +
pork$cluster2*2 +
pork$cluster3*3
plot(TotalPorkWeek ~ VarietyTotal, data = pork,
col = pork[["clusters"]] + 1)
#Or using ggplot2:
ggplot(data = pork, aes(x = VarietyTotal, y = TotalPorkWeek,color = factor(clusters))) +
geom_point() +
guides(color = guide_legend('Cluster')) +
theme_linedraw()
```
```{r, fig.height=3.5, fig.width=4.5, include=FALSE, echo=FALSE}
#5 points are missing a classification
which(is.na(pork$clusters), arr.ind=TRUE)
tb1 <- CreateTableOne(data = pork,
strat = 'clusters',
vars = c('COUNTRY','Locality','Gender','Age','education'))
print(tb1)
```
## Fitting the logistic regression-model
The question we are going to adress here is _is gender related to the being in cluster 1?_. Other tools for this is covered in [Profiling segments], but here we will see how logistic regression can be used for this.
```{r}
pork[["Gender"]] <- as.factor(pork[["Gender"]])
# 1: Male, 2: Female
## Fitting logistic model
logreg.1.gender <- glm(cluster1 ~ factor(Gender), data = pork, family = binomial)
summary(logreg.1.gender)
```
This model indicates Gender is related to cluster1 as the p-value is highly significant.
## Probabilities of segment membership:
One way to get the direction is to interpret the coefficients from the summary() output, but alternatively, simply using the model to predict membership probabilities will give similar insight.
Here the probability of beloging to cluster 1 is predicted for the situation where Gender is $1$ as well as $2$. Be aware that such predictions should match with the data source used for building the model. Here that means, that if Gender is encoded with numbers (1 or 2), then the predictions are likewise done usign those. If the Gender is encoded with _male_ and _female_, then that it the inputs when doing predictions.
```{r}
predict(logreg.1.gender, data.frame(Gender = "1"),
type = "response")
predict(logreg.1.gender, data.frame(Gender = "2"),
type = "response")
```
What we see it that the probablity of belonging to cluster1 is higher if the Gender is $1$ (in reference to $2$). This is obviously in line with the results in [Contingency table] and [Pearson Chi-square test] in the former chapter.
## Odds ratios
Odds ratio is the terminology used for interpreting coefficients from a logistic regression model.
In the direct read out of coefficients from summary() returns the natural logarithm of the oddsratio, and hence to get odds ratios the exponential function is used.
```{r}
exp(coef(summary(logreg.1.gender))[2, 1])
exp(confint(logreg.1.gender)[2, ])
```
Here is it seen that odds of belonging to cluster1 is in favor of gender $1$ as both the estimate and its confidence bounds are above 1.
## ORs and Probs
The odds ratio for a binary predictor is given by:
$$ OR = \frac{p_1(1-p_2)}{(1-p_1)p2}$$
where $p_1$ and $p_2$ are probabilities of females and males respectively.
Here we use the observed probability predictions for females and males (see *Probabilities of segment membership* above) to calculate the OR given the formula:
```{r, eval = F, echo=T}
p1 <- 0.27
p2 <- 0.38
OR <-(p1*(1-p2))/((1-p1)*p2) # put in the ratio-calculation here
OR
```
Which is exactly the same as the output from the exponential of the coefficient in the model.
## Effect of Age
Logistic regression can, as normal regression, also be used with continous predictors, such as Age.
```{r}
head(pork[, c("Age", "cluster1")])
logreg.1.age <- glm(cluster1 ~ Age, data = pork, family = binomial)
# as log(OR)
coef(summary(logreg.1.age))
# and as OR (only for the Age)
exp(coef(summary(logreg.1.age))[2,1])
```
The coefficient indicates a very weak tendency towards, that cluster1 has older individuals (the estimate is positive), but the inference for this results is very non-significant.
## Multivariate analysis
Evaluating the effect of country while adjusting for age and gender.
### Descriptives
First we look at the crude percentages of cluster1 membership for the different countries.
```{r}
tb <- table(pork$COUNTRY, pork$cluster1)
prop.table(tb,1)
```
It seems as if especially country $4$ is higher in cluster 1 compared to the other.
### Two nested models
In order to investigate the effect of country, we make a model **with country**, and a model **without country** (But appart from that similar!). Then we compare the drop in ability to fit data using anova. This indicates whether country were an important factor.
```{r}
pork[["COUNTRY"]] <- as.factor(pork[["COUNTRY"]])
# Model with country
logreg.1.cntr_1 <- glm(cluster1 ~ COUNTRY + Age + Gender,
data = pork, family = binomial)
# Model without country
logreg.1.cntr_0 <- glm(cluster1 ~ Age + Gender,
data = pork, family = binomial)
# Comparison
anova(logreg.1.cntr_1,logreg.1.cntr_0, test = 'Chisq')
#Or in one line of code, after defining the model WITH country:
drop1(logreg.1.cntr_1, test = "Chisq")
```
So the effect of country is strongly significant.
### Coefficients
Lets look at how similar/different the levels of education is.
```{r}
coef(summary(logreg.1.cntr_1))
```
The estimates for *first level* of country (1) is within *(Intercept)*.
The estimates for the remaining three levels are all **IN CONTRAST** to the intercept, and hence country 1.
So it seems as if country 3,4 and 6 are different from country 1, but we have no idea on whether say country 3 is different from country 6 and so forth.
### Re-level
In order to get other pairs of contrast we can re-level the factor country, and repeat the model
Here we re-level to the third level of country which is country==4.
```{r}
pork[['COUNTRY']] <- relevel(pork[['COUNTRY']],3)
logreg.1.cntr_1a <- glm(cluster1 ~ COUNTRY + Age + Gender,
data = pork, family = binomial)
coef(summary(logreg.1.cntr_1a))
```
This can be repeated setting all levels as reference. But there is a more fair and easy solution.
### All pairwise comparisons
We want to compare all pairs of the $4$ country levels. This is a multiple comparison task, and can be undertaken by the **glht()** function from the **multcomp** package.
```{r, message=FALSE}
library(multcomp)
summary(glht(logreg.1.cntr_1, linfct = mcp(COUNTRY = "Tukey")))
cld(glht(logreg.1.cntr_1, linfct = mcp(COUNTRY = "Tukey")))
```
Here we get all $6$ pairs of pairwise comparisons. And it appears as country 1 and 3 are not statistically different, as well as country 3 and 6 ($p>0.1$). However, country 1 and 6.
Does this compare with the percentages from the descriptive analysis?
## Segment 2 and 3
Use the code above to conduct the same analysis for segment/cluster 2 and 3, and reveal
* Gender differences
* Age effect
* Country differences
## A new set of data
The data **plantbaseddiet**, found in the **data4consumerscience**-package constitute data from the following study:
*Reipurth, Malou FS, Lasse Hørby, Charlotte G. Gregersen, Astrid Bonke, and Federico JA Perez Cueto. "Barriers and facilitators towards adopting a more plant-based diet in a sample of Danish consumers." Food quality and preference 73 (2019): 288-292.*
Here we use the clusters from the consumer segmentation analysis as provided in [Segmentation - another example]
```{r}
rm(list =ls())
library(data4consumerscience)
data("plantbaseddiet")
set.seed(123)
res <- kmeans(plantbaseddiet[,c('a_meat', 'a_dairy','a_eggs')], 4)
res$centers
plantbaseddiet$clusters <- factor(res$cluster,labels = c('High Meat','Low all','High Dairy','High All'))
# add as binary columns
plantbaseddiet$cluster_highmeat <- as.numeric(plantbaseddiet$clusters=='High Meat')
plantbaseddiet$cluster_lowall <- as.numeric(plantbaseddiet$clusters=='Low all')
plantbaseddiet$cluster_highdairy <- as.numeric(plantbaseddiet$clusters=='High Dairy')
plantbaseddiet$cluster_highall <- as.numeric(plantbaseddiet$clusters=='High All')
```
## Logistic regression for demographic characterization
### Age
```{r}
library(ggplot2)
ggplot(data = plantbaseddiet, aes(clusters,age)) + geom_boxplot()
mdl_age <- glm(data = plantbaseddiet, clusters=='High Meat' ~ age)
# This is essentially the same as:
# res <- glm(data = plantbaseddiet, cluster_highmeat ~ age)
summary(mdl_age)
```
### Gender
```{r}
ggplot(data = plantbaseddiet, aes(clusters,fill = gender)) + geom_bar(position = 'dodge')
tb2 <- table(plantbaseddiet$clusters,plantbaseddiet$gender)
tb2
prop.table(tb2,margin = 1)
mdl_gender <- glm(data = plantbaseddiet, clusters=='High Meat' ~ gender)
summary(mdl_gender)
exp(coef(summary(mdl_gender))[2, 1])
```
## TASK
* Try to exchange which cluster you are looking at to see relation to age and gender
* Try to look at some other characteristics in the dataset.
* Can you compare clusters against the variables *a_meat*, *a_dairy* or *a_eggs* using logistic regression? Why should you be cautios with interpreting these results?
## The tidyverse way
We want to produce the results from the Table 4 in the paper.
In principle, this table is $11$ survey likert scale answers against $4$ clusters... I.e. $44$ logistic regression models. If you do this in the traditional way it takes up tons of code, and even small changes will be time-consuming to implement.
For instance, the first element of Table 4 is calculated as:
```{r}
mdl <- glm(data = plantbaseddiet, clusters=='High Dairy' ~ o_prepar + age + gender + factor(education),
family = binomial)
exp(coef(summary(mdl)))
exp(confint(mdl))
```
But we want all!
So: Tidyverse and broom for the rescue!
```{r}
library(tidyverse)
library(broom)
tb <- plantbaseddiet %>%
mutate(id = 1) %>% # make a vector of 1's
spread(clusters,id, fill = 0) %>% # distribute the clustering into 4 new coloumns and make sure that the binary cluster vectors are 1 and 0.
pivot_longer(names_to = 'plant_plantbaseddiet', values_to = 'answ',o_prepar:o_family) %>% # long format for survey questions
pivot_longer(names_to = 'cluster_type',values_to = 'cluster',`High Meat`:`High All`) %>% # long format for clusters
group_by(plant_plantbaseddiet,cluster_type) %>% # loop over all questions and all clusters
do(glm(data = ., cluster==1 ~ answ + factor(education) + age + gender, family = binomial) %>%
tidy(exponentiate = T,conf.int = T)) # perform logistic regression
tb %>% head()
```
This table (tb) has all the results, with estimates being OR and with confidence intervals. However, there are *too much* information. For instance are the intercept as well as the covariates reported. We can tidy it even more:
```{r}
tb4 <- tb %>%
ungroup %>%
filter(term=='answ') %>%
mutate(OR = paste(round(estimate,2),
' (', round(conf.low,2),'-',
round(conf.high,2),')',
sep = '')) %>%
dplyr::select(plant_plantbaseddiet,cluster_type,OR) %>%
spread(cluster_type,OR)
tb4
```
Now this just needs to be exported to excel (use e.g. export() from the rio package), and a bit of **love** to be ready for a publication.
## Comment
Try to understand each of the commands in the pipeline, what they do and why they make sense.
You can put in a hash-tag (#) before the the pipe sign (%>%) to block the pipeline and see the structure of the output at this point.