-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcode.Rmd
332 lines (258 loc) · 10.3 KB
/
code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
---
title: "Handling data frames the data.table way"
author: "S. Rao"
date: "10/06/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```
## Summary of the content
- The data.table philosophy
- How data.table differs from base R & how it differs from dplyr
- Run through with example dataset
- loading data into R
- subsetting and filtering
- grouping and summarising
- sorting
- joining
- reshaping
- chaining several operations
I aim to introduce data.table to those who are somewhat familiar with base R and/or dplyr, with some examples showing similarities and differences. But this is just to anchor you to some familiar syntax that you have already encountered, and it is not necessary to run the base R and dplyr code. (I also do not make any speed comparisons or formal benchmarking; there are several examples of head-to-head benchmarking comparisons between data.table vs dplyr vs base R online and while data.table is arguably much faster, it may not be the same (or may not even matter) in your use case.)
## The data.table package
The data.table package was created by Matt Dowle and co-developed by Arun Srinivasan. The data.table philosophy is
- speed
- concision - (almost) everything can be done within `[]`
- few dependencies
**DT[i, j, by = ]**
where:
- `i` is row subsetting
- `j` is column selection and any arbitrary R expression that returns a list
- `by =` is grouping variable(s)
Additionally:
- `.N` for number of observations (rows)
- `on =` is joining (merging) variable(s)
- `.()` shortcut for `list()`
- `.SD` and `.SDcols` - subset of dataset (not discussed here)
- `set*` functions - e.g. `setnames()`, `setDT()`, `setorder()` perform actions on the data.table by reference (i.e. in place)
Let's load the packages we are going to need.
```{r}
library(data.table)
library(dplyr)
library(ggplot2)
library(babynames)
```
## 1. Loading data `fread()`
We will mainly use the data from the babynames package for later work, but intially let's go through data.table's `fread` function.
`fread` can take a url, a filename or the output of a bash command as input. For example, let's read a csv file from a url and write it to a gzipped file locally using `data.table::fwrite`
```{r}
flights <- fread("https://raw.githubusercontent.com/wiki/arunsrinivasan/flights/NYCflights14/flights14.csv")
fwrite(flights, "flights.csv.gz")
```
We can read a compressed file directly (if `R.utils` package is installed)
```{r}
fread("flights.csv.gz")
```
We can read output from arbitrary bash commands. Here we use the cut command in bash to read in only columns 1-8 (the same can be done within the `fread()` function using the `select = ` argument)
```{r}
fread(cmd = "zcat flights.csv.gz | cut -d, -f1-8")
```
Let's make a copy of the babynames dataset, coerce it to a data.table.
```{r}
class(babynames) # a tibble
babynames_df <- as.data.frame(babynames)
class(babynames_df) # a data.frame
babynames_dt <- as.data.table(babynames)
class(babynames_dt)
head(babynames_dt) # a data.table
```
## 2. Subsetting data `[i, j]`
Remember again:
**DT[i, j, by = ]**
where:
- `i` is row subsetting
- `j` is column selection and any arbitrary R expression that returns a list
- `by =` is grouping variable(s)
Additionally:
- `.N` for number of observations (rows)
- `on =` is joining (merging) variable(s)
- `.()` shortcut for `list()`
- `.SD` and `.SDcols` - subset of dataset (not discussed here)
- `set*` functions - e.g. `setnames()`, `setDT()`, `setorder()` perform actions on the data.table by reference (i.e. in place)
Let's try making a new data.table for only the year 2000. (Important to remember that if there is only one argument within `[]`, base R interprets this as the _column_ (because data.frames are lists internally), whereas data.table interprets this as the _row_.)
```{r, eval=FALSE}
# base R
babynames_df[babynames_df$year == 1880, ] # try this: babynames_df[babynames_df$year == 1880]
# data.table
babynames_dt[year == 1880] # equivalent to babynames_dt[year == 2000, ]
# dplyr
babynames %>%
filter(year == 1880) # equivalent to filter(babynames, year == 2000)
```
In a data.table `[]`
- the first argument `i` is _always_ row (not so with base R!)
- can refer to column or variable names directly (without quotes or referring to the data.frame, similar to dplyr)
- if expression in `j` outputs a `list`, the subset is a `data.table`, if not, it is an atomic vector
- `j` can be coerced to a list to make sure the subset is a `data.table` using `list()` or `.()`
Next, let's subset the dataset to give us just the `name` column for the year 2000.
```{r, eval=FALSE}
# base R
babynames_df[babynames_df$year == 2000, "name"] # equivalent to babynames_df[babynames_df$year == 2000, 3]
# data.table
babynames_dt[year == 2000, name] # NOT the same as babynames_dt[year == 2000, 3]
babynames_dt[year == 2000, list(name)] # the same as babynames_dt[year == 2000, 3]
babynames_dt[year == 2000, .(name)] # .() is just a shorthand for list()
# dplyr
babynames %>%
filter(year == 2000) %>%
select(name)
```
**Introducing .N**
- stands for the number of rows in the current grouping (or subset)
- similar to `count()` function in dplyr
<p style="color:blue">
Q. How many different names were given to babies born in the year 2000?
</p>
```{r}
# base R
nrow(babynames_df[babynames_df$year == 2000, ]) # returns an atomic vector
# data.table
babynames_dt[year == 2000, .N] # returns an atomic vector here, but a data.table is returned if there is more than 1 group
babynames_dt[year == 2000, .(.N)] # or wrap j in .() to consistently return a data.table
# dplyr
babynames %>%
filter(year == 2000) %>%
count() # returns a tibble
```
<p style="color:blue">
Q. How many babies were named "Gandalf", in all the years that data has been collected?
</p>
```{r}
# babynames_dt[i, j]
```
## detour - adding new columns
**By copying**
Just like base R, and similar to `mutate()` in dplyr.
```{r}
babynames_dt$firstletter <- substr(babynames_dt$name, start = 1, stop = 1)
```
**By reference `:=`**
Changes the data.table _in place_.
```{r}
babynames_dt[ , firstletter := substr(name, start = 1, stop = 1)]
head(babynames_dt)
```
## 3. Grouping `by =`
The `by = ` argument takes the grouping variable. For example, if we want to group by the first letter
<p style="color:blue">
Q. How many baby names start with each letter of the alphabet?
</p>
```{r}
counts_by_letter <- babynames_dt[ , .(total = sum(n)), by = firstletter]
counts_by_letter
```
<p style="color:blue">
Q. How many babies were named "Trinity" in *each of the years* between 1995 and 2004 (inclusive)?
hint: subset by name and year range in `i`, select year and n columns in `j`
</p>
```{r}
babynames_dt[name == "Trinity" & year > 1994 & year < 2006 , .(total = sum(n)), by = year] %>%
ggplot(aes(x = year, y = total)) +
geom_line()
```
## 4. Sorting
**By copying** - using `order()`, which is a base R function that has been optimised in data.table
```{r}
counts_by_letter[order(firstletter)] # A to Z
counts_by_letter[order(-firstletter)] # Z to A
counts_by_letter[order(total)] # ascending total
counts_by_letter[order(-total)] # descending total
```
**By reference**
```{r}
setorder(counts_by_letter, -firstletter)
counts_by_letter
```
## 5. Joining using the `on = ` keyword
```{r}
letters_df <- data.table(firstletter = LETTERS, letter_no = 1:26)
letters_df
counts_by_letter[letters_df, on = .(firstletter)]
```
## 6. Reshaping data.tables `melt` and `dcast`
Making long tables wide using `dcast`
formula: LHS ~ RHS
- `.` represents **no variable**
- `...` represents **all other variables not explicitly mentioned in LHS or RHS**
Let's first make a new data.table that summarises the total number of individuals by the first letter of their name and the year of birth:
```{r}
counts_by_letter_and_year <- babynames_dt[, .(total = sum(n)), by = .(firstletter, year)]
counts_by_letter_and_year
```
Let's visualise that, for fun.
```{r}
counts_by_letter_and_year %>%
ggplot(aes(x = year, y = total, colour = firstletter)) +
geom_line()
```
Using `dcast()` to make a wide table (year in rows, each letter as a separate column)
```{r}
widetable <- dcast(counts_by_letter_and_year, year ~ firstletter)
widetable
```
How popular are Z names over the years?
```{r}
widetable %>%
ggplot(aes(x = year, y = Z)) +
geom_line()
```
Making wide tables long using `melt`
- `id.vars` - the columns that are 'kept as is'
- `measure.vars` - columns that are 'melted'
- `variable.name` - name for the names of the melted columns
- `value.name` - name for the values from the melted columns
```{r}
widetable
melt(widetable, id.vars = c("year"), variable.name = "letter", value.name = "total")
```
## 7. Bringing it all together
<p style="color:blue">
Q. Which palindromic names are the most common (in all years combined)?
</p>
```{r}
# function to check if a string is a palindrome
# returns TRUE if the reverse of a name is the same as the name itself
# ignores capitalisation
is_palindrome <- function(name) {
rev_name = sapply(strsplit(name, ""), function(x) paste(rev(x), collapse = ""))
return(tolower(name) == tolower(rev_name))
}
# is_palindrome("lol")
```
Using data.table:
```{r}
# in plain English: take the babynames dataset, subset it for palindromic names, and sum the number of individuals for each name, and sort by the total number
babynames_dt[is_palindrome(name), .(total = sum(n)), by = name][order(-total), ]
```
Equivalent code with dplyr
```{r}
babynames_dt %>%
filter(is_palindrome(name)) %>%
group_by(name) %>%
summarise(total = sum(n)) %>%
arrange(desc(total))
```
Equivalent code with base R
```{r}
babynames_palindromic <- babynames_df[is_palindrome(babynames_df$name), ]
babynames_aggregated <- aggregate(babynames_palindromic$n, by = list(babynames_palindromic$name), FUN = "sum")
babynames_aggregated[order(babynames_aggregated$x, decreasing = TRUE), ]
```
Bonus question:
<p style="color:blue">
Q. What are the top 10 most common names among millennials (let's go with the Wikipedia definition of millennials = those born between the birth years 1981 and 1996 (inclusive))?
</p>
## Resources
- data.table vignettes! https://cran.r-project.org/web/packages/data.table/vignettes/
- Cheatsheet - https://github.com/rstudio/cheatsheets/raw/master/datatable.pdf
- Github repo - https://github.com/Rdatatable/data.table/wiki/Articles