code.Rmd

---
title: "Handling data frames the data.table way"
author: "S. Rao"
date: "10/06/2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```

## Summary of the content

- The data.table philosophy
- How data.table differs from base R & how it differs from dplyr
- Run through with example dataset
  - loading data into R
  - subsetting and filtering
  - grouping and summarising
  - sorting
  - joining 
  - reshaping
  - chaining several operations
  

I aim to introduce data.table to those who are somewhat familiar with base R and/or dplyr, with some examples showing similarities and differences. But this is just to anchor you to some familiar syntax that you have already encountered, and it is not necessary to run the base R and dplyr code. (I also do not make any speed comparisons or formal benchmarking; there are several examples of head-to-head benchmarking comparisons between data.table vs dplyr vs base R online and while data.table is arguably much faster, it may not be the same (or may not even matter) in your use case.)

## The data.table package

The data.table package was created by Matt Dowle and co-developed by Arun Srinivasan. The data.table philosophy is 

- speed
- concision - (almost) everything can be done within `[]`
- few dependencies

**DT[i, j, by = ]**

where:

- `i` is row subsetting
- `j` is column selection and any arbitrary R expression that returns a list
- `by =` is grouping variable(s)

Additionally:

- `.N` for number of observations (rows)
- `on =` is joining (merging) variable(s)
- `.()` shortcut for `list()`
- `.SD` and `.SDcols` - subset of dataset (not discussed here)
- `set*` functions - e.g. `setnames()`, `setDT()`, `setorder()` perform actions on the data.table by reference (i.e. in place)

Let's load the packages we are going to need. 

```{r}
library(data.table)
library(dplyr)
library(ggplot2)
library(babynames)
```

## 1. Loading data `fread()`

We will mainly use the data from the babynames package for later work, but intially let's go through data.table's `fread` function.

`fread` can take a url, a filename or the output of a bash command as input. For example, let's read a csv file from a url and write it to a gzipped file locally using `data.table::fwrite`
```{r}
flights <- fread("https://raw.githubusercontent.com/wiki/arunsrinivasan/flights/NYCflights14/flights14.csv")
fwrite(flights, "flights.csv.gz")
```
We can read a compressed file directly (if `R.utils` package is installed)
```{r}
fread("flights.csv.gz")
```

We can read output from arbitrary bash commands. Here we use the cut command in bash to read in only columns 1-8 (the same can be done within the `fread()` function using the `select = ` argument)
```{r}
fread(cmd = "zcat flights.csv.gz | cut -d, -f1-8")
```

Let's make a copy of the babynames dataset, coerce it to a data.table. 
```{r}
class(babynames) # a tibble
babynames_df <- as.data.frame(babynames)
class(babynames_df) # a data.frame
babynames_dt <- as.data.table(babynames)
class(babynames_dt)
head(babynames_dt) # a data.table
```

## 2. Subsetting data `[i, j]`

Remember again:

**DT[i, j, by = ]**

where:

- `i` is row subsetting
- `j` is column selection and any arbitrary R expression that returns a list
- `by =` is grouping variable(s)

Additionally:

- `.N` for number of observations (rows)
- `on =` is joining (merging) variable(s)
- `.()` shortcut for `list()`
- `.SD` and `.SDcols` - subset of dataset (not discussed here)
- `set*` functions - e.g. `setnames()`, `setDT()`, `setorder()` perform actions on the data.table by reference (i.e. in place)


Let's try making a new data.table for only the year 2000. (Important to remember that if there is only one argument within `[]`, base R interprets this as the _column_ (because data.frames are lists internally), whereas data.table interprets this as the _row_.)
```{r, eval=FALSE}
# base R
babynames_df[babynames_df$year == 1880, ] # try this: babynames_df[babynames_df$year == 1880]
# data.table
babynames_dt[year == 1880] # equivalent to babynames_dt[year == 2000, ]
# dplyr
babynames %>%
  filter(year == 1880) # equivalent to filter(babynames, year == 2000)
```

In a data.table `[]`

- the first argument `i` is _always_ row (not so with base R!)
- can refer to column or variable names directly (without quotes or referring to the data.frame, similar to dplyr)
- if expression in `j` outputs a `list`, the subset is a `data.table`, if not, it is an atomic vector
- `j` can be coerced to a list to make sure the subset is a `data.table` using `list()` or `.()`

Next, let's subset the dataset to give us just the `name` column for the year 2000.
```{r, eval=FALSE}
# base R
babynames_df[babynames_df$year == 2000, "name"] # equivalent to babynames_df[babynames_df$year == 2000, 3]
# data.table
babynames_dt[year == 2000, name] # NOT the same as babynames_dt[year == 2000, 3]
babynames_dt[year == 2000, list(name)] # the same as babynames_dt[year == 2000, 3]
babynames_dt[year == 2000, .(name)] # .() is just a shorthand for list()
# dplyr
babynames %>%
  filter(year == 2000) %>%
  select(name)
```

**Introducing .N**

- stands for the number of rows in the current grouping (or subset)
- similar to `count()` function in dplyr

<p style="color:blue">
Q. How many different names were given to babies born in the year 2000?
</p>
```{r}
# base R
nrow(babynames_df[babynames_df$year == 2000, ]) # returns an atomic vector
# data.table
babynames_dt[year == 2000, .N] # returns an atomic vector here, but a data.table is returned if there is more than 1 group
babynames_dt[year == 2000, .(.N)] # or wrap j in .() to consistently return a data.table
# dplyr 
babynames %>%
  filter(year == 2000) %>%
  count() # returns a tibble
```

<p style="color:blue">
Q. How many babies were named "Gandalf", in all the years that data has been collected?
</p>
```{r}
# babynames_dt[i, j]

```

## detour - adding new columns

**By copying**

Just like base R, and similar to `mutate()` in dplyr.

```{r}
babynames_dt$firstletter <- substr(babynames_dt$name, start = 1, stop = 1)
```

**By reference `:=`**

Changes the data.table _in place_. 
```{r}
babynames_dt[ , firstletter := substr(name, start = 1, stop = 1)]
head(babynames_dt)
```

## 3. Grouping `by =`

The `by = ` argument takes the grouping variable. For example, if we want to group by the first letter

<p style="color:blue">
Q. How many baby names start with each letter of the alphabet?
</p>
```{r}
counts_by_letter <- babynames_dt[ , .(total = sum(n)), by = firstletter]
counts_by_letter
```

<p style="color:blue">
Q. How many babies were named "Trinity" in *each of the years* between 1995 and 2004 (inclusive)?
hint: subset by name and year range in `i`, select year and n columns in `j`
</p>
```{r}
babynames_dt[name == "Trinity" & year > 1994 & year < 2006 , .(total = sum(n)), by = year] %>%
  ggplot(aes(x = year, y = total)) +
    geom_line()
```

## 4. Sorting

**By copying** - using `order()`, which is a base R function that has been optimised in data.table

```{r}
counts_by_letter[order(firstletter)]  # A to Z
counts_by_letter[order(-firstletter)] # Z to A
counts_by_letter[order(total)]   # ascending total
counts_by_letter[order(-total)]  # descending total
```

**By reference**
```{r}
setorder(counts_by_letter, -firstletter)
counts_by_letter
```

## 5. Joining using the `on = ` keyword

```{r}
letters_df <- data.table(firstletter = LETTERS, letter_no = 1:26)
letters_df
counts_by_letter[letters_df, on = .(firstletter)]
```

## 6. Reshaping data.tables `melt` and `dcast`

Making long tables wide using `dcast` 

formula: LHS ~ RHS

- `.` represents **no variable**
- `...` represents **all other variables not explicitly mentioned in LHS or RHS**

Let's first make a new data.table that summarises the total number of individuals by the first letter of their name and the year of birth:
```{r}
counts_by_letter_and_year <- babynames_dt[, .(total = sum(n)), by = .(firstletter, year)]
counts_by_letter_and_year
```

Let's visualise that, for fun.
```{r}
counts_by_letter_and_year %>%
  ggplot(aes(x = year, y = total, colour = firstletter)) +
    geom_line()
```

Using `dcast()` to make a wide table (year in rows, each letter as a separate column)
```{r}
widetable <- dcast(counts_by_letter_and_year, year ~ firstletter) 
widetable
```

How popular are Z names over the years?
```{r}
widetable %>%
  ggplot(aes(x = year, y = Z)) + 
    geom_line()
```

Making wide tables long using `melt` 

- `id.vars` - the columns that are 'kept as is'
- `measure.vars` - columns that are 'melted'
- `variable.name` - name for the names of the melted columns
- `value.name` - name for the values from the melted columns

```{r}
widetable
melt(widetable, id.vars = c("year"), variable.name = "letter", value.name = "total")
```

## 7. Bringing it all together

<p style="color:blue">
Q. Which palindromic names are the most common (in all years combined)?
</p>
```{r}
# function to check if a string is a palindrome
# returns TRUE if the reverse of a name is the same as the name itself
# ignores capitalisation
is_palindrome <- function(name) {
  rev_name = sapply(strsplit(name, ""), function(x) paste(rev(x), collapse = ""))
  return(tolower(name) == tolower(rev_name))
}

# is_palindrome("lol")
```

Using data.table:
```{r}
# in plain English: take the babynames dataset, subset it for palindromic names, and sum the number of individuals for each name, and sort by the total number
babynames_dt[is_palindrome(name), .(total = sum(n)), by = name][order(-total), ]
```


Equivalent code with dplyr
```{r}
babynames_dt %>%
  filter(is_palindrome(name)) %>%
  group_by(name) %>%
  summarise(total = sum(n)) %>%
  arrange(desc(total))
```

Equivalent code with base R
```{r}
babynames_palindromic <- babynames_df[is_palindrome(babynames_df$name), ]
babynames_aggregated <- aggregate(babynames_palindromic$n, by = list(babynames_palindromic$name), FUN = "sum")
babynames_aggregated[order(babynames_aggregated$x, decreasing = TRUE), ]
```

Bonus question:
<p style="color:blue">
Q. What are the top 10 most common names among millennials (let's go with the Wikipedia definition of millennials = those born between the birth years 1981 and 1996 (inclusive))?
</p>

## Resources

- data.table vignettes! https://cran.r-project.org/web/packages/data.table/vignettes/ 
- Cheatsheet - https://github.com/rstudio/cheatsheets/raw/master/datatable.pdf
- Github repo - https://github.com/Rdatatable/data.table/wiki/Articles