-
Notifications
You must be signed in to change notification settings - Fork 0
/
301-edav-wrangling.qmd
648 lines (469 loc) · 25.9 KB
/
301-edav-wrangling.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
# Data Wrangling {#edav-wrangling}
```{r edav-wrangling-setup}
#| results: asis
#| echo: false
source("_common.R")
status("complete")
```
## What is Data Wrangling?
::: medium_right
<img style = "border-radius: 10%;" src="images/301-edav-wrangling/horst-wrangle-monsters.jpg" alt="Two happy looking round fuzzy monsters, each holding a similarly shaped wrench with the word “wrangle” on it. Between their tools is held up a rectangular data table labelled “TIDY.”">
:::
Okay, so you've got some data. That's a great start!
You might have had it handed to you by a collaborator, [requested it via an API](#data-apis) or [scraped it from the raw html of a webpage](#data-webscraping). In the worst case scenario, you're an _actual_ scientist (not just a _data_ one) and you spent the last several months of your life painstakingly measuring flower petals or car parts. Now we really want to do something useful with that data.
We've seen already how you can load the data into R and pivot between wider and longer formats, but that probably isn't enough to satisfy your curiosity. You want to be able to view your data, manipulate and subset it, create new variables from existing ones and cross-reference your dataset with others. All of these are things possible in R and are known under various collective names including data manipulation, data munging and data wrangling.
I've decided to use the term data wrangling here. That's because data manipulation sounds boring as heck and data munging is both unpleasant to say and makes me imagine we are squelching through some sort of information swamp.
In what follows, I'll give a fly-by tour of tools for data wrangling in R, showing some examples along the way. I'll focus on some of the most common and useful operations and link out to some more extensive guides for wrangling your data in R, that you can refer back to as you need them.
## Example Data Sets
:::: {.columns}
:::{.column width="40%"}
```{r plamer-penguins-logo}
#| echo: false
#| eval: true
#| out-width: 60%
#| fig-align: center
#| fig-alt: "Palmer penguins R package hex sticker. Three cartoon penguins on a light blue hexagonal logo."
knitr::include_graphics("images/301-edav-wrangling/palmer-penguins.png")
```
:::
::: {.column width="60%"}
To demonstrate some standard skills we will use two datasets. The `mtcars` data comes built into any R installation. The second data set we will look at is the `penguins` data from `{palmerpenguins}`.
:::
```{r edav-example-datasets}
#| echo: true
#| eval: true
library(palmerpenguins)
penguins <- palmerpenguins::penguins
cars <- datasets::mtcars
```
::::
## Viewing Your Data
### `View()`
The `View()` function can be used to create a spreadsheet-like view of your data. In RStudio this will open as a new tab.
`View()` will work for any "matrix-like" R object, such as a tibble, data frame, vector or matrix. Note the capital letter - the function is called `View()`, not `view()`.
```{r view-palmer-penguins}
#| eval: false
View(penguins)
```
```{r palmer-penguins-screenshot}
#| echo: false
#| eval: true
#| out-width: 90%
#| fig-align: center
#| fig-alt: "Screenshot of RStudio files pane, containing a spreadsheet view of the palmer penguins data set."
knitr::include_graphics("images/301-edav-wrangling/view-penguins-screenshot.png")
```
### `head()`
For large data sets, you might not want (or be able to) view it all at once. You can then use `head()` to view the first few rows. The integer argument `n` specifies the number of rows you would like to return.
```{r penguins-head}
#| echo: true
#| eval: true
head(x = penguins, n = 3)
```
### `str()`
An alternative way to view a data set that is large or has a complicated format is to examine its structure using `str()`. This is a useful way to inspect list-like objects with a nested structure.
```{r penguins-structure}
#| echo: true
#| eval: true
str(penguins)
```
### `names()`
If you just want to access the variable names you can do so with the `names()` function from base R.
```{r penguins-names}
#| echo: true
#| eval: true
names(penguins)
```
Similarly, you can explicitly access the row and column names of a data frame or tibble using `colnames()` or `rownames()`.
```{r cars-colnames}
#| echo: true
#| eval: true
colnames(cars)
```
```{r cars-rownames}
#| echo: true
#| eval: true
rownames(cars)
```
In the `cars` data, the car model are stored as the row names. This doesn't really jive with our idea of tidy data - we'll see how to fix that shortly.
## Renaming Variables
### `colnames()`
The function `colnames()` can be used to set, as well as to retrieve, column names.
```{r cars-renaming-with-colnames}
#| echo: true
#| eval: true
cars_renamed <- cars
colnames(cars_renamed)[1] <- "miles_per_gallon"
colnames(cars_renamed)
```
### `dplyr::rename()`
We can also use functions from `{dplyr}` to rename columns. Let's alter the second column name.
```{r cars-renaming-with-dplyr}
#| echo: true
#| eval: true
#| warning: false
library(dplyr)
cars_renamed <- rename(.data = cars_renamed, cylinders = cyl)
colnames(cars_renamed)
```
This could be done as part of a pipe if we were making many alterations.
```{r cars-renaming-with-pipe}
#| echo: true
#| eval: true
cars_renamed <- cars_renamed %>%
rename(displacement = disp) %>%
rename(horse_power = hp) %>%
rename(rear_axel_ratio = drat)
colnames(cars_renamed)
```
When using the `{dplyr}` function `rename()`, you have to remember the format `new_name = old_name`. This matches the format used to create a data frame or tibble, but is the opposite order to the python function of the same name and often catches people out.
In the section on [creating new variables](#creating-new-variables), we will see an alternative way of doing this by copying the column and then deleting the original.
## Subsetting
### Base R
In base R you can extract rows, columns and combinations thereof using index notation.
```{r base-r-subsetting}
#| echo: true
#| eval: true
# First row
penguins[1, ]
# First Column
penguins[ , 1]
# Rows 2-3 of columns 1, 2 and 4
penguins[2:3, c(1, 2, 4)]
```
Using negative indexing you can remove rows or columns
```{r base-r-drop-rows}
#| echo: true
#| eval: true
# Drop all but first row
penguins[-(2:344), ]
```
```{r base-r-drop-columns}
#| echo: true
#| eval: true
# Drop all but first column
penguins[ , -(2:8)]
```
You can also select rows or columns by their names. This can be done using the bracket syntax (`[ ]`) or the dollar syntax (`$`).
```{r base-r-extract-columns-by-name}
#| echo: true
#| eval: true
penguins[ ,"species"]
penguins$species
```
Since `penguins` is a tibble, these return different types of object. Subsetting a tibble with bracket syntax will return a tibble but extracting a column using the dollar syntax returns a vector of values.
### `filter()` and `select()`
`{dplyr}` has two functions for subsetting, `filter()` subsets by rows and `select()` subsets by column.
In both functions you list what you would like to retain. Filter and select calls can be piped together to subset based on row and column values.
```{r penguins-select}
#| echo: true
#| eval: true
penguins %>%
select(species, island, body_mass_g)
```
```{r penguins-filter}
#| echo: true
#| eval: true
penguins %>%
select(species, island, body_mass_g) %>%
filter(body_mass_g > 6000)
```
Subsetting rows can be inverted by negating the `filter()` statement
```{r penguins-filter-with-negation}
#| echo: true
#| eval: true
penguins %>%
select(species, island, body_mass_g) %>%
filter(!(body_mass_g > 6000))
```
and dropping columns can done by selecting all columns except the one(s) you want to drop.
```{r penguins-select-with-negation}
#| echo: true
#| eval: true
penguins %>%
select(species, island, body_mass_g) %>%
filter(!(body_mass_g > 6000)) %>%
select(!c(species, island))
```
## Creating New Variables {#creating-new-variables}
### Base R
We can create new variables in base R by assigning a vector of the correct length to a new column name.
```{r base-r-rename-columns}
#| echo: true
#| eval: true
cars_renamed$weight <- cars_renamed$wt
```
If we then drop the original column from the data frame, this gives us an alternative way of renaming columns.
```{r base-r-renaming-columns-alternative}
#| echo: true
#| eval: true
cars_renamed <- cars_renamed[ ,-which(names(cars_renamed) == "wt")]
head(cars_renamed, n = 5)
```
One thing to be aware of is that this operation does not preserve column ordering.
Generally speaking, code that relies on columns being in a specific order is fragile - it breaks easily. If possible, you should try to write your code in another way that's robust to column reordering. I've done that here when removing the `wt` column by looking up the column index as part of my code, rather than assuming it will always be the fourth column.
### `dplyr::mutate()`
The function from `{dplyr}` to create new columns is `mutate()`. Let's create another column that has the car's weight in kilogrammes rather than tonnes.
```{r dplyr-mutate}
#| echo: true
#| eval: true
cars_renamed <- cars_renamed %>%
mutate(weight_kg = weight * 1000)
cars_renamed %>%
select(miles_per_gallon, cylinders, displacement, weight, weight_kg) %>%
head(n = 5)
```
You can also create new columns that are functions of multiple other columns.
```{r dplyr-mutate-from-multiple-columns}
#| echo: true
#| eval: true
cars_renamed <- cars_renamed %>%
mutate(cylinder_adjusted_mpg = miles_per_gallon / cylinders)
```
### `rownames_to_column()`
One useful example of adding an additional row to a data frame is to convert its row names to a column of the data fame.
```{r dplyr-rownames-to-column}
#| echo: true
#| eval: true
cars %>%
mutate(model = rownames(cars_renamed)) %>%
select(mpg, cyl, model) %>%
head(n = 5)
```
There's a neat function called `rownames_to_column()` in the `{tibble}` package. This will add the row names as the first column and remove the row names all in one step.
```{r tibble-rownames-to-column}
#| echo: true
#| eval: true
cars %>%
tibble::rownames_to_column(var = "model") %>%
head(n = 5)
```
### `rowids_to_column()`
Another function from `{tibble}` adds the row id of each observation as a new column. This is often useful when ordering or combining tables.
```{r tibble-row-ids-to-column}
#| echo: true
#| eval: true
cars %>%
tibble::rowid_to_column(var = "row_id") %>%
head(n = 5)
```
## Summaries
The `summarise()` function allows you to collapse a data frame into a single row, which using a summary statistic of your choosing.
We can calculate the average bill length of all penguins in a single `summarise()` function call.
```{r penguins-summarise}
#| echo: true
#| eval: true
summarise(penguins, average_bill_length_mm = mean(bill_length_mm))
```
Since we have missing values, we might instead want to calculate the mean of the recorded values.
```{r penguins-summarise-with-arguments}
#| echo: true
#| eval: true
summarise(penguins, average_bill_length_mm = mean(bill_length_mm, na.rm = TRUE))
```
We can also use `summarise()` to gather multiple summaries in a single data frame.
```{r penguins-summarise-multiple-statistics}
#| echo: true
#| eval: true
bill_length_mm_summary <- penguins %>%
summarise(
mean = mean(bill_length_mm, na.rm = TRUE),
median = median(bill_length_mm, na.rm = TRUE),
min = max(bill_length_mm, na.rm = TRUE),
q_0 = min(bill_length_mm, na.rm = TRUE),
q_1 = quantile(bill_length_mm, prob = 0.25, na.rm = TRUE),
q_2 = median(bill_length_mm, na.rm = TRUE),
q_3 = quantile(bill_length_mm, prob = 0.75, na.rm = TRUE),
q_4 = max(bill_length_mm, na.rm = TRUE))
bill_length_mm_summary
```
In all, this isn't overly exciting. You might rightly wonder why you'd want to use `summarise()` when we could just use the simpler base R calls directly.
One benefit of `summarise()` is that it provides certainty that the obejct returned will be of a certain class (a tibble) no matter what summary function is used. However, `summarise()` really comes into its own when you want to apply these summaries to distinct subgroups of the data.
## Grouped Operations
The real benefit of `summarise()` comes from its combination with `group_by()`. This allows to you calculate the same summary statistics for each level of a factor with only one additional line of code.
Here we're re-calculating the same set of summary statistics we just found for all penguins, but for each individual species.
```{r dplyr-grouped-summarise}
#| echo: true
#| eval: true
penguins %>%
group_by(species) %>%
summarise(
mean = mean(bill_length_mm, na.rm = TRUE),
median = median(bill_length_mm, na.rm = TRUE),
min = max(bill_length_mm, na.rm = TRUE),
q_0 = min(bill_length_mm, na.rm = TRUE),
q_1 = quantile(bill_length_mm, prob = 0.25, na.rm = TRUE),
q_2 = median(bill_length_mm, na.rm = TRUE),
q_3 = quantile(bill_length_mm, prob = 0.25, na.rm = TRUE),
q_4 = max(bill_length_mm, na.rm = TRUE))
```
You can group by multiple factors to calculate summaries for each distinct combination of levels within your data set. Here we group by combinations of species and the island to which they belong.
```{r dplyr-repeated-grouping}
#| echo: true
#| eval: true
penguin_summary_stats <- penguins %>%
group_by(species, island) %>%
summarise(
mean = mean(bill_length_mm, na.rm = TRUE),
median = median(bill_length_mm, na.rm = TRUE),
min = max(bill_length_mm, na.rm = TRUE),
q_0 = min(bill_length_mm, na.rm = TRUE),
q_1 = quantile(bill_length_mm, prob = 0.25, na.rm = TRUE),
q_2 = median(bill_length_mm, na.rm = TRUE),
q_3 = quantile(bill_length_mm, prob = 0.25, na.rm = TRUE),
q_4 = max(bill_length_mm, na.rm = TRUE))
penguin_summary_stats
```
### Ungrouping
By default, each call to `summarise()` will undo one level of grouping. This means that our previous result was still grouped by species. We can see this in the tibble output above or by examining the structure of the returned data frame. This tells us that this is an S3 object of class `grouped_df`, which inherits its properties from a `tbl_df`, `tbl`, and `data.frame` objects.
```{r penguins-still-grouped}
#| echo: true
#| eval: true
class(penguin_summary_stats)
```
Since we have grouped by two variables, `{dplyr}` expects us to use two `summary()` function calls before it will return a data frame (or tibble) that is not grouped. One way to satisfy this is to apply a second summary at the species level of grouping.
```{r penguins-still-grouped-evidence}
#| echo: true
#| eval: true
#| warning: false
penguin_summary_stats %>%
summarise_all(mean, na.rm = TRUE)
```
However, we won't always want to do apply another summary. In that case, we can undo the grouping using `ungroup()`. Remembering to ungroup is a common mistake and cause of confusion when working with multiple-group summaries.
```{r ungrouping-penguins}
#| echo: true
#| eval: true
ungroup(penguin_summary_stats)
```
There's an alternative method to achieve the same thing in a single step when using `{dplyr}` versions 1.0.0 and above. This is to to set the `.by` parameter of the `summarise()` function call, which determines the grouping that should be applied to the original data frame, just for this one operation.
```{r}
penguins %>%
summarise(mean_bill_length_mm = mean(bill_length_mm, na.rm = TRUE), .by = island)
```
The `.by` argument applies to a single operation. This means that the result of the `summarise()` call will always be an ungrouped tibble, regardless of the number of grouping columns.
```{r}
penguins %>%
summarise(
mean_bill_length_mm = mean(bill_length_mm, na.rm = TRUE),
.by = c(island, species))
```
See `?dplyr_by` for more information on using the `.by` argument with `summarise()` and other `{dplyr}` verbs.
## Reordering Factors
R stored factors as integer values, which it then maps to a set of labels. Only factor levels that appear in your data will be assigned a coded integer value and the mapping between factor levels and integers will depend on the order that the labels appear in your data.
This can be annoying, particularly when your factor levels relate to properties that aren't numerical but do have an inherent ordering to them. In the example below, we have the t-shirt size of twelve people.
```{r making-tshirst-tibble}
#| echo: true
#| eval: true
tshirts <- tibble::tibble(
id = 1:12,
size = as.factor(c("L", NA, "M", "S", "XS", "M", "XXL", "L", "XS", "M", "L", "S"))
)
levels(tshirts$size)
```
Irritatingly, the sizes aren't in order and extra large isn't included because it's not included in this particular sample. This leads to awkward looking summary tables and plots.
```{r tshirts-summarise}
#| echo: true
#| eval: true
tshirts %>% group_by(size) %>% summarise(count = n())
```
We can fix this by creating a new variable with the factors explicitly coded in the correct order. We also need to specify that we should not drop empty groups as part of `group_by()`.
```{r tshirts-grouped-summaries}
#| echo: true
#| eval: true
tidy_tshirt_levels <- c("XS", "S", "M", "L", "XL", "XXL", NA)
tshirts %>%
mutate(size_tidy = factor(size, levels = tidy_tshirt_levels)) %>%
group_by(size_tidy, .drop = FALSE ) %>%
summarise(count = n())
```
## Be Aware: Factors
As we have seen a little already, categorical variables can cause issues when wrangling and presenting data in R. All of these problems are solvable using base R techniques but the `{forcats}` package provides tools for the most common of these problems. This includes functions for changing the order of factor levels or the values with which they are associated.
Some examples functions from the package include:
- `fct_reorder()`: Reordering a factor by another variable.
- `fct_infreq()`: Reordering a factor by the frequency of values.
- `fct_relevel()`: Changing the order of a factor by hand.
- `fct_lump()`: Collapsing the least/most frequent values of a factor into “other”.
Examples of each of these can be found in the [forcats vignette](https://forcats.tidyverse.org/articles/forcats.html) or the [factors chapter]((https://r4ds.had.co.nz/factors.html)) of R for data science.
## Be Aware: Strings
Working with and analysing text data is a skill unto itself. However, it is useful to be able to do some basic manipulation of character strings programatically.
Because R was developed as a statistical programming language, it is well suited to the computational and modelling aspects of working with text data but the base R string manipulation functions can be a bit unwieldy at times.
The `{stringr}` package aims to combat this by providing useful helper functions for a range of text management problems. Even when not analysing text data these can be useful, such as when removing prefixes from many column names.
```{r make-poorly-named-df}
#| echo: false
#| eval: true
#| warning: false
dat <- matrix(rnorm(100), nrow = 10, ncol = 10)
poorly_named_df <- as_data_frame(dat)
colnames(poorly_named_df) <- stringr::str_c("V",1:10,"_",LETTERS[1:10])
poorly_named_df <- poorly_named_df %>%
tibble::rowid_to_column(var = "observation_id")
```
Suppose we wanted to keep only the text following an underscore in these column names. We could do that by using a regular expression to extract lower-case or upper-case letters which follow an underscore.
```{r show-poorly-named-df}
#| echo: true
#| eval: true
head(poorly_named_df)
```
```{r extract-column-names-with-regex}
#| echo: true
#| eval: true
stringr::str_extract(names(poorly_named_df), pattern = "(?<=_)([a-zA-Z]+)")
```
Alternatively, can avoid using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression). We can use `stringr::str_split()` to divide each column name at the underscore and keep only the second part of each string.
```{r clean-column-names-manually}
#| echo: true
#| eval: true
# split column names at underscores and inspect structure of resuting object
split_strings <- stringr::str_split(names(poorly_named_df), pattern = "_")
str(split_strings)
# keep only the second element of each character vector in the list
purrr::map_chr(split_strings, function(x){x[2]})
```
Unless you plan to work extensively with text data, I would recommend that you look up such string manipulations as you need them. The [strings](https://r4ds.had.co.nz/strings.html#strings) section of R for Data Science is a useful starting point.
## Be Aware: Date-Times
Remember [all the fuss](#naming-things-date-order) we made about storing dates in the ISO standard format? That was because dates and times are complicated enough to work on without adding extra ambiguity.
$$ \text{YYYY} - \text{MM} - \text{DD}$$
Dates, times and time intervals have to reconcile two factors: the physical orbit of the Earth around the Sun and the social and geopolitical mechanisms that determine how we measure and record the passing of time. This makes the history of date and time records fascinating but also make working with this type of data complicated.
Moving from larger to smaller time spans: leap years alter the number of days in a year, months are of variable length (with February's length changing from year to year). If your data are measured in a place that uses daylight saving, then one day a year will be 23 hours long and another will be 25 hours long. To make things worse, the dates and the hour at which the clocks change are not uniform across countries, which might span multiple time zones and those time-zone boundaries can shift over time.
Even at the level of minutes and seconds we aren't safe - since the Earth's orbit is gradually slowing down and a leap second is added approximately every 21 months. Nor are things any better when looking at longer time scales or across cultures, where we might have to account for different calendars: months are added removed and altered over time, other calendar systems still take different approaches to measuring time and using different units and origin points.
With all of these issues you have to be very careful when working with date and time data. Functions to help you with this can be found in the `{lubridate}` package, with examples in the [dates and times](https://r4ds.had.co.nz/dates-and-times.html#dates-and-times) chapter of R for data science.
## Be Aware: Relational Data
When the data you need are stored across two or more data frames you need to be able to cross-reference those and match up values for observational unit. This sort of data is know as relational data, and is used extensively in data science.
The variables you use to match observational units across data frames are known as _keys_. The primary key belongs to the first table and the foreign key belongs to the secondary table. There are various ways to join these data frames, depending on if you want to retain.
#### Join types
You might want to keep only observational units that have key variables values in both data frames, this is known as an inner join.
```{r inner-join-image}
#| echo: false
#| eval: true
#| out-width: 90%
#| fig-align: center
#| fig-cap: "Inner join diagram. Source: R for Data Science"
#| fig-alt: "Schematic diagram showing that inner joins retain only rows from that occur in both tables."
knitr::include_graphics("images/301-edav-wrangling/join-inner.png")
```
You might instead want to keep all units from the primary table but pad with NAs where there is not a corresponding foreign key in the second table. This results in an __(outer) left-join__.
```{r types-of-join-image}
#| echo: false
#| eval: true
#| out-width: 90%
#| fig-align: center
#| fig-cap: "Diagram for left, right and outer joins. Source: R for Data Science"
#| fig-alt: "Schematic diagram showing left, right and full joins. Left joins retain all rows from the first table, filling NAs where there is no matching key value in the second table. Right joins do the converse, retaining all rows from the second table, padding with NAs as needed. Full joins keep all values from both tables, padding with NAs as needed."
knitr::include_graphics("images/301-edav-wrangling/join-left-right-full.png")
```
Conversely, you might keep all units from the second table but pad with NAs where there is not a corresponding foreign key in the primary table. This is imaginatively named an __(outer) right-join__.
In the __(outer) full join__, all observational units from either table are retained and all missing values are padded with NAs.
Things get more complicated when keys don't uniquely identify observational units in either one or both of the tables. I'd recommend you start exploring these ideas with the [relational data](https://r4ds.had.co.nz/relational-data.html) chapter of R for Data Science.
#### Why and where to learn more
Working with relational data is essential to getting any data science up and running out in the wilds of reality. This is because businesses and companies don't store all of their data in a huge single csv file. For one this isn't very efficient, because most cells would be empty. Secondly, it's not a very secure approach, since you can't grant partial access to the data. That's why information is usually stored in many data frames (more generically known as tables) within one or more databases.
These data silos are created, maintained, accessed and destroyed using a relational data base management system. These management systems use code to manage and access the stored data, just like we have seen in the dplyr commands above. You might well have heard of the SQL programming language (and its many variants), which is a popular language for data base management and is the inspiration for the dplyr package and verbs.
If you'd like to learn more then there are many excellent introductory SQL books and courses, I'd recommend picking one that focuses on data analysis or data science unless you really want to dig into efficient storage and querying of databases.
## Wrapping up
We have:
- Learned how to wrangle tabular data in R with `{dplyr}`
- Met the idea of relational data and `{dplyr}`'s relationship to SQL
- Become aware of some tricky data types and packages that can help.
## Session Information
```{r}
pander::pander(sessionInfo())
```