-
Notifications
You must be signed in to change notification settings - Fork 230
/
factors.Rmd
361 lines (287 loc) · 9.71 KB
/
factors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
# Factors {#factors .r4ds-section}
## Introduction {#introduction-9 .r4ds-section}
Functions and packages:
```{r setup,message = FALSE,cache=FALSE}
library("tidyverse")
```
The forcats package does not need to be explicitly loaded, since the recent versions of the tidyverse package now attach it.
## Creating factors {#creating-factors .r4ds-section}
`r no_exercises()`
## General Social Survey {#general-social-survey .r4ds-section}
### Exercise 15.3.1 {.unnumbered .exercise data-number="15.3.1"}
<div class="question">
Explore the distribution of `rincome` (reported income).
What makes the default bar chart hard to understand?
How could you improve the plot?
</div>
<div class="answer">
My first attempt is to use `geom_bar()` with the default settings.
```{r}
rincome_plot <-
gss_cat %>%
ggplot(aes(x = rincome)) +
geom_bar()
rincome_plot
```
The problem with default bar chart settings, are that the labels overlapping and impossible to read.
I'll try changing the angle of the x-axis labels to vertical so that they will not overlap.
```{r}
rincome_plot +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
This is better because the labels are not overlapping, but also difficult to read because the labels are vertical.
I could try angling the labels so that they are easier to read, but not overlapping.
```{r}
rincome_plot +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
But the solution I prefer for bar charts with long labels is to flip the axes, so that the bars are horizontal.
Then the category labels are also horizontal, and easy to read.
```{r}
rincome_plot +
coord_flip()
```
Though more than asked for in this question, I could further improve this plot by
1. removing the "Not applicable" responses,
1. renaming "Lt \$1000" to "Less than \$1000",
1. using color to distinguish non-response categories ("Refused", "Don't know", and "No answer") from income levels ("Lt $1000", ...),
1. adding meaningful y- and x-axis titles, and
1. formatting the counts axis labels to use commas.
```{r}
gss_cat %>%
filter(!rincome %in% c("Not applicable")) %>%
mutate(rincome = fct_recode(rincome,
"Less than $1000" = "Lt $1000"
)) %>%
mutate(rincome_na = rincome %in% c("Refused", "Don't know", "No answer")) %>%
ggplot(aes(x = rincome, fill = rincome_na)) +
geom_bar() +
coord_flip() +
scale_y_continuous("Number of Respondents", labels = scales::comma) +
scale_x_discrete("Respondent's Income") +
scale_fill_manual(values = c("FALSE" = "black", "TRUE" = "gray")) +
theme(legend.position = "None")
```
If I were only interested in non-missing responses, then I could drop all respondents who answered "Not applicable", "Refused", "Don't know", or "No answer".
```{r}
gss_cat %>%
filter(!rincome %in% c("Not applicable", "Don't know", "No answer", "Refused")) %>%
mutate(rincome = fct_recode(rincome,
"Less than $1000" = "Lt $1000"
)) %>%
ggplot(aes(x = rincome)) +
geom_bar() +
coord_flip() +
scale_y_continuous("Number of Respondents", labels = scales::comma) +
scale_x_discrete("Respondent's Income")
```
A side-effect of `coord_flip()` is that the label ordering on the x-axis, from lowest (top) to highest (bottom) is counterintuitive.
The next section introduces a function `fct_reorder()` which can help with this.
</div>
### Exercise 15.3.2 {.unnumbered .exercise data-number="15.3.2"}
<div class="question">
What is the most common `relig` in this survey?
What’s the most common `partyid`?
</div>
<div class="answer">
The most common `relig` is "Protestant"
```{r}
gss_cat %>%
count(relig) %>%
arrange(desc(n)) %>%
head(1)
```
The most common `partyid` is "Independent"
```{r}
gss_cat %>%
count(partyid) %>%
arrange(desc(n)) %>%
head(1)
```
</div>
### Exercise 15.3.3 {.unnumbered .exercise data-number="15.3.3"}
<div class="question">
Which `relig` does `denom` (denomination) apply to?
How can you find out with a table?
How can you find out with a visualization?
</div>
<div class="answer">
```{r}
levels(gss_cat$denom)
```
From the context it is clear that `denom` refers to "Protestant" (and unsurprising given that it is the largest category in `freq`).
Let's filter out the non-responses, no answers, others, not-applicable, or
no denomination, to leave only answers to denominations.
After doing that, the only remaining responses are "Protestant".
```{r}
gss_cat %>%
filter(!denom %in% c(
"No answer", "Other", "Don't know", "Not applicable",
"No denomination"
)) %>%
count(relig)
```
This is also clear in a scatter plot of `relig` vs. `denom` where the points are
proportional to the size of the number of answers (since otherwise there would be overplotting).
```{r}
gss_cat %>%
count(relig, denom) %>%
ggplot(aes(x = relig, y = denom, size = n)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90))
```
</div>
## Modifying factor order {#modifying-factor-order .r4ds-section}
### Exercise 15.4.1 {.unnumbered .exercise data-number="15.4.1"}
<div class="question">
There are some suspiciously high numbers in `tvhours`.
Is the `mean` a good summary?
</div>
<div class="answer">
```{r}
summary(gss_cat[["tvhours"]])
```
```{r}
gss_cat %>%
filter(!is.na(tvhours)) %>%
ggplot(aes(x = tvhours)) +
geom_histogram(binwidth = 1)
```
Whether the mean is the best summary depends on what you are using it for :-), i.e. your objective.
But probably the median would be what most people prefer.
And the hours of TV doesn't look that surprising to me.
</div>
### Exercise 15.4.2 {.unnumbered .exercise data-number="15.4.2"}
<div class="question">
For each factor in `gss_cat` identify whether the order of the levels is arbitrary or principled.
</div>
<div class="answer">
The following piece of code uses functions introduced in Ch 21, to print out the names of only the factors.
```{r}
keep(gss_cat, is.factor) %>% names()
```
There are six categorical variables: `marital`, `race`, `rincome`, `partyid`, `relig`, and `denom`.
The ordering of marital is "somewhat principled". There is some sort of logic
in that the levels are grouped "never married", married at some point
(separated, divorced, widowed), and "married"; though it would seem that "Never
Married", "Divorced", "Widowed", "Separated", "Married" might be more natural.
I find that the question of ordering can be determined by the level of
aggregation in a categorical variable, and there can be more "partially
ordered" factors than one would expect.
```{r}
levels(gss_cat[["marital"]])
```
```{r}
gss_cat %>%
ggplot(aes(x = marital)) +
geom_bar()
```
The ordering of race is principled in that the categories are ordered by count of observations in the data.
```{r}
levels(gss_cat$race)
```
```{r}
gss_cat %>%
ggplot(aes(race)) +
geom_bar() +
scale_x_discrete(drop = FALSE)
```
The levels of `rincome` are ordered in decreasing order of the income; however
the placement of "No answer", "Don't know", and "Refused" before, and "Not
applicable" after the income levels is arbitrary. It would be better to place
all the missing income level categories either before or after all the known
values.
```{r}
levels(gss_cat$rincome)
```
The levels of `relig` is arbitrary: there is no natural ordering, and they don't appear to be ordered by stats within the dataset.
```{r}
levels(gss_cat$relig)
```
```{r}
gss_cat %>%
ggplot(aes(relig)) +
geom_bar() +
coord_flip()
```
The same goes for `denom`.
```{r}
levels(gss_cat$denom)
```
Ignoring "No answer", "Don't know", and "Other party", the levels of `partyid` are ordered from "Strong Republican"" to "Strong Democrat".
```{r}
levels(gss_cat$partyid)
```
</div>
### Exercise 15.4.3 {.unnumbered .exercise data-number="15.4.3"}
<div class="question">
Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?
</div>
<div class="answer">
Because that gives the level "Not applicable" an integer value of 1.
</div>
## Modifying factor levels {#modifying-factor-levels .r4ds-section}
### Exercise 15.5.1 {.unnumbered .exercise data-number="15.5.1"}
<div class="question">
How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
</div>
<div class="answer">
To answer that, we need to combine the multiple levels into Democrat, Republican, and Independent
```{r}
levels(gss_cat$partyid)
```
```{r}
gss_cat %>%
mutate(
partyid =
fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)
) %>%
count(year, partyid) %>%
group_by(year) %>%
mutate(p = n / sum(n)) %>%
ggplot(aes(
x = year, y = p,
colour = fct_reorder2(partyid, year, p)
)) +
geom_point() +
geom_line() +
labs(colour = "Party ID.")
```
</div>
### Exercise 15.5.2 {.unnumbered .exercise data-number="15.5.2"}
<div class="question">
How could you collapse `rincome` into a small set of categories?
</div>
<div class="answer">
Group all the non-responses into one category, and then group other categories into a smaller number. Since there is a clear ordering, we would not use `fct_lump()`.`
```{r}
levels(gss_cat$rincome)
```
```{r}
library("stringr")
gss_cat %>%
mutate(
rincome =
fct_collapse(
rincome,
`Unknown` = c("No answer", "Don't know", "Refused", "Not applicable"),
`Lt $5000` = c("Lt $1000", str_c(
"$", c("1000", "3000", "4000"),
" to ", c("2999", "3999", "4999")
)),
`$5000 to 10000` = str_c(
"$", c("5000", "6000", "7000", "8000"),
" to ", c("5999", "6999", "7999", "9999")
)
)
) %>%
ggplot(aes(x = rincome)) +
geom_bar() +
coord_flip()
```
</div>