-
Notifications
You must be signed in to change notification settings - Fork 0
/
summarize.qmd
164 lines (116 loc) · 8.08 KB
/
summarize.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: "Summarize"
---
## Module Learning Objectives
By the end of this module, you will be able to:
- <u>Describe</u> the purpose of the pipe operator (`%>%`)
- <u>Use</u> the pipe operator (`%>%`) to chain multiple functions together
- <u>Summarize</u> data by using `dplyr`'s `group_by` and `summarize` functions
```{r libraries-summarize, echo = F, message = F}
library(tidyverse); library(palmerpenguins)
```
## Pipe Operator (`%>%`)
Before diving into the Tidyverse functions that allow for summarization and group-wise operations, let's talk about the pipe operator (`%>%`). The pipe is from the `magrittr` package and allows chaining together multiple functions without needing to create separate objects at each step as you would have to without the pipe.
### `%>%` Example: Using the Pipe
:::callout-note
## Example
As in the other chapters, let's use the "penguins" data object found in the `palmerpenguins` package. Let's say we want to keep only specimens that have a measurement for both bill length and bill depth and then remove the flipper and body mass columns.
Without the pipe--but still using other Tidyverse functions--we could go about this like this:
```{r without-pipe, message = F}
# Filter out the NAs
penguins_v2 <- dplyr::filter(.data = penguins,
!is.na(bill_length_mm) & !is.na(bill_depth_mm))
# Now strip away the columns we don't want
penguins_v3 <- dplyr::select(.data = penguins_v2,
-flipper_length_mm, -body_mass_g)
# And we can look at our final product with `base::head`
dplyr::glimpse(penguins_v3)
```
Using the pipe though we can simplify this code dramatically! Note that each of the following lines must end with the `%>%` so that R knows there are more lines to consider.
```{r with-pipe, message = F}
# We begin with the name of the data object
penguins %>%
# Then we can filter the data
dplyr::filter(!is.na(bill_length_mm) & !is.na(bill_depth_mm)) %>%
# And strip away the columns we don't want
dplyr::select(-flipper_length_mm, -body_mass_g) %>%
# And we can even include the `glimpse` function to see our progress
dplyr::glimpse()
```
Note that using the pipe allows each line to inherit the data created by the previous line.
:::
### Challenge: `%>%`
:::callout-important
## Your Turn!
Using pipes, `filter` the data to only include male penguins, `select` only the columns for species, island, and body mass, and `filter` out any rows with NA for body mass.
:::
### Aside: Fun History of Why `%>%` is a "Pipe"
<img src="images/logos/hex_magrittr.png" alt="Hex logo for the 'magrittr' R package" align="right" width="15%"/>
The Belgian painter René Magritte famously created a painting titled "[The Treachery of Images](https://collections.lacma.org/node/239578)" featuring a depiction of a smoking pipe above the words "*Cest ci n'est pas une pipe*" (French for "This is not a pipe"). Magritte's point was about how the depiction of a thing is not equal to thing itself. The `magrittr` package takes its name from the painter because it also includes a pipe that functions slightly differently from a command line pipe and uses different characters. Just like Magritte's pipe, `%>%` both is and isn't a pipe!
## Group-Wise Summarizing
Now that we've covered the `%>%` operator we can use it to do group-wise summarization! Technically this summarization does not *require* the pipe but it does inherently have two steps and thus benefits from using the pipe to chain together those technically separate instructions.
To summarize by groups we first define our groups using `dplyr`'s `group_by` function and then summarize using `summarize` (also from `dplyr`). `summarize` does require you to specify what calculations you want to perform within your groups though it uses similar syntax to `dplyr`'s `mutate` function.
<p align="center">
<img src="images/summarize-group-by.png" alt="Graphic of a table with an 'A' and 'B' column where the 'A' column contains one of two shapes becoming a smaller table with one row per type of shape and an 'A' and 'C' column" width="50%" />
</p>
Despite the similarity in syntax between `summarize` and `mutate` there are a few crucial differences:
- `summarize` returns only a single row per group while `mutate` returns as many rows as are in the original dataframe
- `summarize` will automatically remove any columns that aren't either (1) included in `group_by` or (2) created by `summarize`. `mutate` cannot remove columns so it only creates whatever you tell it to.
### `group_by` + `summarize` Example: Summarize within Groups
:::callout-note
## Example
By using the `%>%` with `group_by` and `summarize`, we can calculate some summarized metric within our specified groups. To begin, let's find the average bill depth within each species of penguin.
```{r group_by-summarize-basic, message = F}
# Begin with the data and a pipe
penguins %>%
# Group by the desired column names
dplyr::group_by(species) %>%
# And summarize in the way we desire
dplyr::summarize(mean_bill_dep_mm = mean(bill_depth_mm, na.rm = TRUE) )
```
Notice how the resulting dataframe only contains one row per value in the `group_by` call and only includes the grouping column and the column we created (`mean_bill_dep_mm`)? This reduction in dimensions is an inherent property of `summarize` and can be intensely valuable but be careful you don't accidentally remove columns that you want!
:::
### `group_by` + `summarize` Example: Calculate Multiple Metrics
:::callout-note
## Example
Let's say we want to find *multiple* summary values for body mass of each species of penguin on each island. To accomplish this we can do the following:
```{r group_by-summarize-complex, message = F}
# Begin with the data and a pipe
penguins %>%
# Group by the desired column names
dplyr::group_by(species, island) %>%
# And summarize in the way we desire
dplyr::summarize(
# Get average body mass
mean_mass_g = mean(body_mass_g, na.rm = TRUE),
# Get the standard deviation
sd_mass = sd(body_mass_g, na.rm = TRUE),
# Count the number of individual penguins of each species at each island
n_mass = dplyr::n(),
# Calculate standard error from SD divided by count
se_mass = sd_mass / sqrt(n_mass) )
```
You can see that we also invoked the `n` function from `dplyr` to return the size of each group. This function reads any groups created by `group_by` and returns the count of rows in the dataframe for each group level.
Just like `mutate`, `summarize` will allow you to create as many columns as you want. So, if you want metrics calculated within your groups, you only need to define each of them within the `summarize` function.
:::
### Challenge: `summarize`
:::callout-important
## Your Turn!
Using what we've covered so far, find the average flipper length in each year (regardless of any other grouping variable).
:::
## Grouping Cautionary Note
`group_by` can be extremely useful in summarizing a dataframe or creating a new column without losing rows but you need to be careful. Objects created with `group_by` "remember" their groups until you change the groups or use the function `ungroup` from `dplyr`.
Look at how the output of a grouped data object tells you the number of groups in the output (see beneath this code chunk).
```{r group_by-grouping, message = F}
penguins %>%
dplyr::group_by(species, island) %>%
dplyr::summarize(penguins_count = dplyr::n())
```
This means that all future uses of that pipe will continue to use the grouping established to create the "penguins_count" column. We can stop this by doing the same pipe, but adding `ungroup` after we're done using the grouping established by `group_by`.
```{r ungroup, message = F}
penguins %>%
dplyr::group_by(species, island) %>%
dplyr::summarize(penguins_count = dplyr::n()) %>%
dplyr::ungroup()
```
See? We calculated with our desired groups but then dropped the grouping structure once we were finished with them. Note also that if you use `group_by` and do some calculation then re-group by something else by using `group_by` *again*, the second use of `group_by` **will not be affected** by the first. This means that you only need one `ungroup` per pipe.