-
Notifications
You must be signed in to change notification settings - Fork 8
/
05-dplyr-pipes.Rmd
276 lines (194 loc) · 6.6 KB
/
05-dplyr-pipes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
---
title: "Pipes with dplyr"
subtitle: "Stat 133"
author: "Gaston Sanchez"
output: github_document
fontsize: 11pt
urlcolor: blue
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error = TRUE, fig.path = '05-images/')
library(knitr)
library(dplyr)
library(ggplot2)
library(magrittr)
```
> ### Learning Objectives:
>
> - Compare base R and `"dplyr"`
> - Get to know the pipe operator `%>%`
------
## Introduction
Last week you started to manipulate data tables (e.g. `data.frame`, `tibble`)
with functions provided by the R package `"dplyr"`.
Having been exposed to the _dplyr_ paradigm, let's compare R base manipulation against the various dplyr syntax flavors.
### Starwars Data Set
In this tutorial we are going to use the data set `starwars` that comes in `"dplyr"`:
```{r warning = FALSE, message = FALSE}
# load dplyr
library(dplyr)
# data set
starwars
```
### Average Height of Male and Female Individuals
For illustration purposes, let's consider a relatively simple example.
Say we are interested in calculating the average (mean) height for both female
and male individuals. Let's discuss how to find the solution under the base R
approach, as well as the dplyr approach.
-----
## Quick inspection of `height`
```{r}
# summary stats of height
summary(starwars$height)
```
```{r height_histogram}
# histogram
hist(starwars$height, col = 'gray80', las = 1)
```
### Quick inspection of `gender`
```{r}
# frequencies of gender
summary(starwars$gender)
gender_freqs <- table(starwars$gender)
gender_freqs
```
```{r gender_barchart}
# barchart of gender freqs
barplot(gender_freqs, border = NA, las = 1)
```
Now let's use `"dplyr"` to get the frequencies:
```{r}
# distinct values
distinct(starwars, gender)
```
Oh! Notice that we have some missing values, which were not reported by `table()`.
```{r}
# frequencies of gender (via dplyr)
count(starwars, gender)
```
-----
## Base R approach
Let's see how to use base R operations to find the average `height` of individuals with `gender` female and male.
```{r}
# identify female and male individuals
# (comparison operations)
which_females <- starwars$gender == 'female'
which_males <- starwars$gender == 'male'
```
```{r}
# select the height values of females and males
# (via logical subsetting)
height_females <- starwars$height[which_females]
height_males <- starwars$height[which_males]
```
```{r}
# calculate averages (removing missing values)
avg_ht_female <- mean(height_females, na.rm = TRUE)
avg_ht_male <- mean(height_males, na.rm = TRUE)
# optional: display averages in a vector
c('female' = avg_ht_female, 'male' = avg_ht_male)
```
All the previous code can be written with more compact expressions:
```{r}
# all calculations in a couple of lines of code
c("female" = mean(starwars$height[starwars$gender == 'female'], na.rm = TRUE),
"male" = mean(starwars$height[starwars$gender == 'male'], na.rm = TRUE)
)
```
-----
## With `"dplyr"`
The behavior of `"dplyr"` is functional in the sense that function calls don't
have side-effects. You must always save their results in order to keep them
in an object (in memory). This doesn't lead to particularly elegant code,
especially if you want to do many operations at once.
### Option 1) Step-by-step
You either have to do it step-by-step:
```{r}
# manipulation step-by-step
gender_height <- select(starwars, gender, height)
fem_male_height <- filter(gender_height,
gender == 'female' | gender == 'male')
height_by_gender <- group_by(fem_male_height, gender)
summarise(height_by_gender, mean(height, na.rm = TRUE))
```
### Option 2) Nested (embedded) code
Or if you don't want to name the intermediate results, you need to wrap the
function calls inside each other:
```{r}
summarise(
group_by(
filter(select(starwars, gender, height),
gender == 'female' | gender == 'male'),
gender),
mean(height, na.rm = TRUE)
)
```
This is difficult to read because the order of the operations is from inside
to out. Thus, the arguments are a long way away from the function.
### Option 3) Piping
To get around the problem of nesting functions, `"dplyr"` also provides the
`%>%` operator from the R package `"magrittr"`.
What does the _piper_ `%>%` do? Here's a conceptual example:
```{r eval = FALSE}
x %>% f(y)
```
`x %>% f(y)` turns into `f(x, y)` so you can use it to rewrite multiple
operations that you can read left-to-right, top-to-bottom.
Here's how to use the piper to calculate the average height for female and
male individuals:
```{r}
avg_height_by_gender <- starwars %>%
select(gender, height) %>%
filter(gender == 'female' | gender == 'male') %>%
group_by(gender) %>%
summarise(avg = mean(height, na.rm = TRUE))
avg_height_by_gender
avg_height_by_gender$avg
```
-----
## Another Example
Here's another example in which we calculate the mean `height` and mean `mass` of `species` Droid, Ewok, and Human; arranging the rows of the tibble by mean height, in descending order:
```{r}
starwars %>%
select(species, height, mass) %>%
filter(species %in% c('Droid', 'Ewok', 'Human')) %>%
group_by(species) %>%
summarise(
mean_height = mean(height, na.rm = TRUE),
mean_mass = mean(mass, na.rm = TRUE)
) %>%
arrange(desc(mean_height))
```
-----
## Pipes and Plots
You can also the `%>%` operator to chain dplyr commands with ggplot commans (and other R commands). The following examples combine some data manipulation to `filter()` female and males individuals, in order to graph a density plot of `height`
```{r densities}
starwars %>%
filter(gender %in% c('female', 'male')) %>%
ggplot(aes(x = height, fill = gender)) +
geom_density(alpha = 0.7)
```
Here's another example in which instead of graphing density plots, we graph boxplots of `height` for female and male individuals:
```{r boxplots}
starwars %>%
filter(gender %in% c('female', 'male')) %>%
ggplot(aes(x = gender, y = height, fill = gender)) +
geom_boxplot()
```
-----
## More Pipes
Often, you will work with functions that don't take data frames (or tibbles) as
inputs. A typical example is the base `plot()` function used to produce a
scatterplot; you need to pass vectors to `plot()`, not data frames. In this
situations you might find the `%$%` operator extremely useful.
```{r eval = FALSE}
library(magrittr)
```
The `%$%` operator, also from the package `"magrittr"`, is a cousin of the
`%>%` operator. What `%$%` does is to _extract_ variables in a data frame
so that you can refer to them explicitly. Let's see a quick example:
```{r scatterplot}
starwars %>%
filter(gender %in% c('female', 'male')) %$%
plot(x = height, y = mass, col = factor(gender), las = 1)
```