-
Notifications
You must be signed in to change notification settings - Fork 230
/
relational-data.Rmd
917 lines (685 loc) · 30.3 KB
/
relational-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
# Relational data {#relational-data .r4ds-section}
## Introduction {#introduction-7 .r4ds-section}
The datamodelr package is used to draw database schema.
```{r package,message=FALSE,cache=FALSE}
library("tidyverse")
library("nycflights13")
library("viridis")
library("datamodelr")
```
## nycflights13 {#nycflights13-relational .r4ds-section}
### Exercise 13.2.1 {.unnumbered .exercise data-number="13.2.1"}
<div class="question">
Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination.
What variables would you need?
What tables would you need to combine?
</div>
<div class="answer">
Drawing the routes requires the latitude and longitude of the origin and the destination airports of each flight.
This requires the `flights` and `airports` tables.
The `flights` table has the origin (`origin`) and destination (`dest`) airport of each flight.
The `airports` table has the longitude (`lon`) and latitude (`lat`) of each airport.
To get the latitude and longitude for the origin and destination of each flight,
requires two joins for `flights` to `airports`,
once for the latitude and longitude of the origin airport,
and once for the latitude and longitude of the destination airport.
I use an inner join in order to drop any flights with missing airports since they will not have a longitude or latitude.
```{r}
flights_latlon <- flights %>%
inner_join(select(airports, origin = faa, origin_lat = lat, origin_lon = lon),
by = "origin"
) %>%
inner_join(select(airports, dest = faa, dest_lat = lat, dest_lon = lon),
by = "dest"
)
```
This plots the approximate flight paths of the first 100 flights in the `flights` dataset.
```{r}
flights_latlon %>%
slice(1:100) %>%
ggplot(aes(
x = origin_lon, xend = dest_lon,
y = origin_lat, yend = dest_lat
)) +
borders("state") +
geom_segment(arrow = arrow(length = unit(0.1, "cm"))) +
coord_quickmap() +
labs(y = "Latitude", x = "Longitude")
```
</div>
### Exercise 13.2.2 {.unnumbered .exercise data-number="13.2.2"}
<div class="question">
I forgot to draw the relationship between `weather` and `airports`.
What is the relationship and how should it appear in the diagram?
</div>
<div class="answer">
The column `airports$faa` is a foreign key of `weather$origin`.
The following drawing updates the one in [Section 13.2](https://r4ds.had.co.nz/relational-data.html#nycflights13-relational) to include this relation.
The line representing the new relation between `weather` and `airports` is colored black.
The lines representing the old relations are gray and thinner.
```{r, echo = FALSE, out.width = NULL, purl = FALSE}
knitr::include_graphics("diagrams/nycflights.png")
```
</div>
### Exercise 13.2.3 {.unnumbered .exercise data-number="13.2.3"}
<div class="question">
Weather only contains information for the origin (NYC) airports.
If it contained weather records for all airports in the USA, what additional relation would it define with `flights`?
</div>
<div class="answer">
If the weather was included for all airports in the US, then it would provide the weather for the destination of each flight.
The `weather` data frame columns (`year`, `month`, `day`, `hour`, `origin`) are a foreign key for the `flights` data frame columns (`year`, `month`, `day`, `hour`, `dest`).
This would provide information about the weather at the destination airport at the time of the flight take off, unless the arrival date-time were calculated.
So why was this not a relationship prior to adding additional rows to the `weather` table?
In a foreign key relationship, the collection of columns in the child table
must refer to a unique collection of columns in the parent table.
When the `weather` table only contained New York airports,
there were many values of (`year`, `month`, `day`, `hour`, `dest`) in `flights` that
did not appear in the `weather` table.
Therefore, it was not a foreign key. It was only after
all combinations of year, month, day, hour, and airports that are defined in `flights`
were added to the `weather` table that there existed this relation between these tables.
</div>
### Exercise 13.2.4 {.unnumbered .exercise data-number="13.2.4"}
<div class="question">
We know that some days of the year are “special”, and fewer people than usual fly on them.
How might you represent that data as a data frame?
What would be the primary keys of that table?
How would it connect to the existing tables?
</div>
<div class="answer">
I would add a table of special dates, similar to the following table.
```{r}
special_days <- tribble(
~year, ~month, ~day, ~holiday,
2013, 01, 01, "New Years Day",
2013, 07, 04, "Independence Day",
2013, 11, 29, "Thanksgiving Day",
2013, 12, 25, "Christmas Day"
)
```
The primary key of the table would be the (`year`, `month`, `day`) columns.
The (`year`, `month`, `day`) columns could be used to join `special_days` with other tables.
</div>
## Keys {#keys .r4ds-section}
### Exercise 13.3.1 {.unnumbered .exercise data-number="13.3.1"}
<div class="question">
Add a surrogate key to flights.
</div>
<div class="answer">
I add the column `flight_id` as a surrogate key.
I sort the data prior to making the key, even though it is not strictly necessary, so the order of the rows has some meaning.
```{r}
flights %>%
arrange(year, month, day, sched_dep_time, carrier, flight) %>%
mutate(flight_id = row_number()) %>%
glimpse()
```
</div>
### Exercise 13.3.2 {.unnumbered .exercise data-number="13.3.2"}
<div class="question">
Identify the keys in the following datasets
1. `Lahman::Batting`
1. `babynames::babynames`
1. `nasaweather::atmos`
1. `fueleconomy::vehicles`
1. `ggplot2::diamonds`
(You might need to install some packages and read some documentation.)
</div>
<div class="answer">
The answer to each part follows.
1. The primary key for `Lahman::Batting` is (`playerID`, `yearID`, `stint`).
The columns (`playerID`, `yearID`) are not a primary key because players can play on different teams within the same year.
```{r}
Lahman::Batting %>%
count(playerID, yearID, stint) %>%
filter(n > 1) %>%
nrow()
```
1. The primary key for `babynames::babynames` is (`year`, `sex`, `name`).
The columns (`year`, `name`) are not a primary key since there are separate counts for each name for each sex, and the same names can be used by more than one sex.
```{r}
babynames::babynames %>%
count(year, sex, name) %>%
filter(n > 1) %>%
nrow()
```
1. The primary key for `nasaweather::atmos` is (`lat`, `long`, `year`, `month`).
The primary key represents the location and time that the measurement was taken.
```{r}
nasaweather::atmos %>%
count(lat, long, year, month) %>%
filter(n > 1) %>%
nrow()
```
1. The column `id`, the unique EPA identifier of the vehicle, is the primary key for `fueleconomy::vehicles`.
```{r}
fueleconomy::vehicles %>%
count(id) %>%
filter(n > 1) %>%
nrow()
```
1. There is no primary key for `ggplot2::diamonds` since there is no combination of variables that uniquely identifies each observation.
This is implied by the fact that the number of distinct rows in the dataset is less than the total number of rows, meaning that there are some duplicate rows.
```{r}
ggplot2::diamonds %>%
distinct() %>%
nrow()
nrow(ggplot2::diamonds)
```
If we need a unique identifier for our analysis, we could add a surrogate key.
```{r}
diamonds <- mutate(ggplot2::diamonds, id = row_number())
```
</div>
### Exercise 13.3.3 {.unnumbered .exercise data-number="13.3.3"}
<div class="question">
Draw a diagram illustrating the connections between the `Batting`, `Master`, and `Salaries` tables in the Lahman package.
Draw another diagram that shows the relationship between `Master`, `Managers`, `AwardsManagers`.
How would you characterize the relationship between the `Batting`, `Pitching`, and `Fielding` tables?
</div>
<div class="answer">
For the `Batting`, `Master`, and `Salaries` tables:
- `Master`
- Primary key: `playerID`
- `Batting`
- Primary key: `playerID`, `yearID`, `stint`
- Foreign keys:
- `playerID` = `Master$playerID` (many-to-1)
- `Salaries`
- Primary key: `yearID`, `teamID`, `playerID`
- Foreign keys:
- `playerID` = `Master$playerID` (many-to-1)
The columns `teamID` and `lgID` are not foreign keys even though they appear in multiple tables (with the same meaning) because they are not primary keys for any of the tables considered in this exercise.
The `teamID` variable references `Teams$teamID`, and `lgID` does not have its own table.
*R for Data Science* uses database schema diagrams to illustrate relations between the tables.
Most flowchart or diagramming software can be used used to create database schema diagrams, as well as some specialized database software.
The diagrams in *R for Data Science* were created with [OmniGraffle](https://www.gliffy.com/), and their sources can be found in its [GitHub repository](https://github.com/hadley/r4ds/tree/master/diagrams).
The following diagram was created with OmniGraffle in the same style as those
in *R for Data Science* .
It shows the relations between the `Master`, `Batting` and `Salaries` tables.
```{r, echo = FALSE, out.width = NULL, purl = FALSE}
knitr::include_graphics("diagrams/Lahman1.png")
```
Another option to draw database schema diagrams is the R package [datamodelr](https://github.com/bergant/datamodelr), which can programmatically create database schema diagrams.
The following code uses datamodelr to draw a diagram of the relations between the `Batting`, `Master`, and `Salaries` tables.
```{r cache=FALSE}
dm1 <- dm_from_data_frames(list(
Batting = Lahman::Batting,
Master = Lahman::Master,
Salaries = Lahman::Salaries
)) %>%
dm_set_key("Batting", c("playerID", "yearID", "stint")) %>%
dm_set_key("Master", "playerID") %>%
dm_set_key("Salaries", c("yearID", "teamID", "playerID")) %>%
dm_add_references(
Batting$playerID == Master$playerID,
Salaries$playerID == Master$playerID
)
dm_create_graph(dm1, rankdir = "LR", columnArrows = TRUE) %>%
dm_render_graph()
```
For the `Master`, `Manager`, and `AwardsManagers` tables:
- `Master`
- Primary key: `playerID`
- `Managers`
- Primary key: `yearID`, `teamID`, `inseason`
- Foreign keys:
- `playerID` references `Master$playerID` (many-to-1)
- `AwardsManagers`:
- Primary key: `playerID`, `awardID`, `yearID`
- Foreign keys:
- `playerID` references `Master$playerID` (many-to-1)
For `AwardsManagers`, the columns (`awardID`, `yearID`, `lgID`) are not a primary
key because there can be, and have been ties, as indicated by the `tie` variable.
The relations between the `Master`, `Managers`, and `AwardsManagers` tables
are shown in the following two diagrams: the first created manually with OmniGraffle,
and the second programmatically in R with the datamodelr package.
```{r, echo = FALSE, out.width = NULL, purl = FALSE}
knitr::include_graphics("diagrams/Lahman2.png")
```
```{r cache=FALSE}
dm2 <- dm_from_data_frames(list(
Master = Lahman::Master,
Managers = Lahman::Managers,
AwardsManagers = Lahman::AwardsManagers
)) %>%
dm_set_key("Master", "playerID") %>%
dm_set_key("Managers", c("yearID", "teamID", "inseason")) %>%
dm_set_key("AwardsManagers", c("playerID", "awardID", "yearID")) %>%
dm_add_references(
Managers$playerID == Master$playerID,
AwardsManagers$playerID == Master$playerID
)
dm_create_graph(dm2, rankdir = "LR", columnArrows = TRUE) %>%
dm_render_graph()
```
The primary keys of `Batting`, `Pitching`, and `Fielding` are the following:
- `Batting`: (`playerID`, `yearID`, `stint`)
- `Pitching`: (`playerID`, `yearID`, `stint`)
- `Fielding`: (`playerID`, `yearID`, `stint`, `POS`).
While `Batting` and `Pitching` has one row per player, year, stint, the `Fielding`
table has additional rows for each position (`POS`) a player played within a stint.
Since `Batting`, `Pitching`, and `Fielding` all share the `playerID`, `yearID`, and `stint`
we would expect some foreign key relations between these tables.
The columns (`playerID`, `yearID`, `stint`) in `Pitching` are a foreign key which
references the same columns in `Batting`. We can check this by checking that
all observed combinations of values of these columns appearing in `Pitching`
also appear in `Batting`. To do this I use an anti-join, which is discussed
in the section [Filtering Joins](https://r4ds.had.co.nz/relational-data.html#filtering-joins).
```{r}
nrow(anti_join(Lahman::Pitching, Lahman::Batting,
by = c("playerID", "yearID", "stint")
))
```
Similarly, the columns (`playerID`, `yearID`, `stint`) in `Fielding` are a foreign key which references the same columns in `Batting`.
```{r}
nrow(anti_join(Lahman::Fielding, Lahman::Batting,
by = c("playerID", "yearID", "stint")
))
```
The following diagram shows the relations between the `Batting`, `Pitching`, and
`Fielding` tables.
```{r, echo = FALSE, out.width = NULL, purl = FALSE}
knitr::include_graphics("diagrams/Lahman3.png")
```
</div>
## Mutating joins {#mutating-joins .r4ds-section}
```{r}
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
```
### Exercise 13.4.1 {.unnumbered .exercise data-number="13.4.1"}
<div class="question">
Compute the average delay by destination, then join on the `airports` data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:
```{r}
airports %>%
semi_join(flights, c("faa" = "dest")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point() +
coord_quickmap()
```
(Don’t worry if you don’t understand what `semi_join()` does — you’ll learn about it next.)
You might want to use the size or color of the points to display the average delay for each airport.
</div>
<div class="answer">
```{r}
avg_dest_delays <-
flights %>%
group_by(dest) %>%
# arrival delay NA's are cancelled flights
summarise(delay = mean(arr_delay, na.rm = TRUE)) %>%
inner_join(airports, by = c(dest = "faa"))
```
```{r}
avg_dest_delays %>%
ggplot(aes(lon, lat, colour = delay)) +
borders("state") +
geom_point() +
coord_quickmap()
```
</div>
### Exercise 13.4.2 {.unnumbered .exercise data-number="13.4.2"}
<div class="question">
Add the location of the origin and destination (i.e. the `lat` and `lon`) to `flights`.
</div>
<div class="answer">
You can perform one join after another. If duplicate variables are found, by default, dplyr will distinguish the two by adding `.x`, and `.y` to the ends of the variable names to solve naming conflicts.
```{r}
airport_locations <- airports %>%
select(faa, lat, lon)
flights %>%
select(year:day, hour, origin, dest) %>%
left_join(
airport_locations,
by = c("origin" = "faa")
) %>%
left_join(
airport_locations,
by = c("dest" = "faa")
)
```
The `suffix` argument overrides this default behavior.
Since is always good practice to have clear variable names, I will use the
suffixes `"_dest"` and `"_origin`" to specify whether the column refers to
the destination or origin airport.
```{r}
airport_locations <- airports %>%
select(faa, lat, lon)
flights %>%
select(year:day, hour, origin, dest) %>%
left_join(
airport_locations,
by = c("origin" = "faa")
) %>%
left_join(
airport_locations,
by = c("dest" = "faa"),
suffix = c("_origin", "_dest")
# existing lat and lon variables in tibble gain the _origin suffix
# new lat and lon variables are given _dest suffix
)
```
</div>
### Exercise 13.4.3 {.unnumbered .exercise data-number="13.4.3"}
<div class="question">
Is there a relationship between the age of a plane and its delays?
</div>
<div class="answer">
The question does not specify whether the relationship is with departure delay
or arrival delay.
I will look at both.
To compare the age of the plane to flights delay, I merge `flights` with the `planes`, which contains a variable `plane_year`, with the year in which the plane was built.
To look at the relationship between plane age and departure delay, I will calculate the average arrival and departure delay for each age of a flight.
Since there are few planes older than 25 years, so I truncate `age` at 25 years.
```{r}
plane_cohorts <- inner_join(flights,
select(planes, tailnum, plane_year = year),
by = "tailnum"
) %>%
mutate(age = year - plane_year) %>%
filter(!is.na(age)) %>%
mutate(age = if_else(age > 25, 25L, age)) %>%
group_by(age) %>%
summarise(
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
dep_delay_sd = sd(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
arr_delay_sd = sd(arr_delay, na.rm = TRUE),
n_arr_delay = sum(!is.na(arr_delay)),
n_dep_delay = sum(!is.na(dep_delay))
)
```
I will look for a relationship between departure delay and age by plotting age against the average departure delay.
The average departure delay is increasing for planes with ages up until 10 years. After that the departure delay decreases or levels off.
The decrease in departure delay could be because older planes with many mechanical issues are removed from service or because air lines schedule these planes with enough time so that mechanical issues do not delay them.
```{r}
ggplot(plane_cohorts, aes(x = age, y = dep_delay_mean)) +
geom_point() +
scale_x_continuous("Age of plane (years)", breaks = seq(0, 30, by = 10)) +
scale_y_continuous("Mean Departure Delay (minutes)")
```
There is a similar relationship in arrival delays.
Delays increase with the age of the plane until ten years, then it declines and flattens out.
```{r}
ggplot(plane_cohorts, aes(x = age, y = arr_delay_mean)) +
geom_point() +
scale_x_continuous("Age of Plane (years)", breaks = seq(0, 30, by = 10)) +
scale_y_continuous("Mean Arrival Delay (minutes)")
```
</div>
### Exercise 13.4.4 {.unnumbered .exercise data-number="13.4.4"}
<div class="question">
What weather conditions make it more likely to see a delay?
</div>
<div class="answer">
Almost any amount of precipitation is associated with a delay.
However, there is not a strong a trend above 0.02 in. of precipitation.
```{r}
flight_weather <-
flights %>%
inner_join(weather, by = c(
"origin" = "origin",
"year" = "year",
"month" = "month",
"day" = "day",
"hour" = "hour"
))
```
```{r}
flight_weather %>%
group_by(precip) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = precip, y = delay)) +
geom_line() + geom_point()
```
There seems to be a stronger relationship between visibility and delay.
Delays are higher when visibility is less than 2 miles.
```{r}
flight_weather %>%
ungroup() %>%
mutate(visib_cat = cut_interval(visib, n = 10)) %>%
group_by(visib_cat) %>%
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = visib_cat, y = dep_delay)) +
geom_point()
```
</div>
### Exercise 13.4.5 {.unnumbered .exercise data-number="13.4.5"}
<div class="question">
What happened on June 13, 2013?
Display the spatial pattern of delays, and then use Google to cross-reference with the weather.
</div>
<div class="answer">
There was a large series of storms (derechos) in the southeastern US (see [June 12-13, 2013 derecho series](https://en.wikipedia.org/wiki/June_12%E2%80%9313,_2013_derecho_series)).
The following plot show that the largest delays were in Tennessee (Nashville), the Southeast, and the Midwest, which were the locations of the derechos.
```{r}
flights %>%
filter(year == 2013, month == 6, day == 13) %>%
group_by(dest) %>%
summarise(delay = mean(arr_delay, na.rm = TRUE)) %>%
inner_join(airports, by = c("dest" = "faa")) %>%
ggplot(aes(y = lat, x = lon, size = delay, colour = delay)) +
borders("state") +
geom_point() +
coord_quickmap() +
scale_colour_viridis()
```
</div>
## Filtering joins {#filtering-joins .r4ds-section}
### Exercise 13.5.1 {.unnumbered .exercise data-number="13.5.1"}
<div class="question">
What does it mean for a flight to have a missing `tailnum`?
What do the tail numbers that don’t have a matching record in planes have in common?
(Hint: one variable explains ~90% of the problems.)
</div>
<div class="answer">
Flights that have a missing `tailnum` all have missing values of `arr_time`, meaning that the flight was canceled.
```{r}
flights %>%
filter(is.na(tailnum), !is.na(arr_time)) %>%
nrow()
```
Many of the tail numbers that don't have a matching value in `planes` are
registered to American Airlines (AA) or Envoy Airlines (MQ).
The documentation for `planes` states
> American Airways (AA) and Envoy Air (MQ) report fleet numbers rather than tail numbers so can't be matched.
```{r}
flights %>%
anti_join(planes, by = "tailnum") %>%
count(carrier, sort = TRUE) %>%
mutate(p = n / sum(n))
```
However, not all tail numbers appearing in`flights` from these carriers are missing from the `planes` table. I don't know how to reconcile this discrepancy.
```{r}
flights %>%
distinct(carrier, tailnum) %>%
left_join(planes, by = "tailnum") %>%
group_by(carrier) %>%
summarise(total_planes = n(),
not_in_planes = sum(is.na(model))) %>%
mutate(missing_pct = not_in_planes / total_planes) %>%
arrange(desc(missing_pct))
```
</div>
### Exercise 13.5.2 {.unnumbered .exercise data-number="13.5.2"}
<div class="question">
Filter flights to only show flights with planes that have flown at least 100 flights.
</div>
<div class="answer">
First, I find all planes that have flown at least 100 flights.
I need to filter flights that are missing a tail number otherwise all flights missing a tail number will be treated as a single plane.
```{r}
planes_gte100 <- flights %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
count() %>%
filter(n >= 100)
```
Now, I will semi join the data frame of planes that have flown at least 100 flights to the data frame of flights to select the flights by those planes.
```{r}
flights %>%
semi_join(planes_gte100, by = "tailnum")
```
This can also be answered with a grouped mutate.
```{r}
flights %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
mutate(n = n()) %>%
filter(n >= 100)
```
</div>
### Exercise 13.5.3 {.unnumbered .exercise data-number="13.5.3"}
<div class="question">
Combine `fueleconomy::vehicles` and `fueleconomy::common` to find only the records for the most common models.
</div>
<div class="answer">
```{r}
fueleconomy::vehicles %>%
semi_join(fueleconomy::common, by = c("make", "model"))
```
Why does the above code join on `make` and `model` and not just `model`?
It is possible for two car brands (`make`) to produce a car with the same name (`model`).
In both the `vehicles` and `common` data we can find some examples.
For example, "Truck 4WD" is produced by many different brands.
```{r}
fueleconomy::vehicles %>%
distinct(model, make) %>%
group_by(model) %>%
filter(n() > 1) %>%
arrange(model)
```
```{r}
fueleconomy::common %>%
distinct(model, make) %>%
group_by(model) %>%
filter(n() > 1) %>%
arrange(model)
```
If we were to merge these data on the `model` column alone, there would be incorrect matches.
</div>
### Exercise 13.5.4 {.unnumbered .exercise data-number="13.5.4"}
<div class="question">
Find the 48 hours (over the course of the whole year) that have the worst delays.
Cross-reference it with the weather data.
Can you see any patterns?
</div>
<div class="answer">
I will start by clarifying how I will be measuring the concepts in the question.
There are three concepts that need to be defined more precisely.
1. What is meant by "delay"?
I will use departure delay.
Since the `weather` data only contains data for the New York City airports, and
departure delays will be more sensitive to New York City weather conditions than arrival delays.
1. What is meant by "worst"? I define worst delay as the average departure delay per flight for flights *scheduled* to depart in that hour.
For hour, I will use the scheduled departure time rather than the actual departure time.
If planes are delayed due to weather conditions, the weather conditions during the scheduled time are more important than the actual departure time, at which point, the weather could have improved.
1. What is meant by "48 hours over the course of the year"? This could mean two days, a span of 48 contiguous hours,
or 48 hours that are not necessarily contiguous hours.
I will find 48 not-necessarily contiguous hours.
That definition makes better use of the methods introduced in this section and chapter.
1. What is the unit of analysis? Although the question mentions only hours, I will use airport hours.
The weather dataset has an observation for each airport for each hour.
Since all the departure airports are in the vicinity of New York City, their weather should be similar, it will not be the same.
First, I need to find the 48 hours with the worst delays.
I group flights by hour of scheduled departure time and calculate the average delay.
Then I select the 48 observations (hours) with the highest average delay.
```{r}
worst_hours <- flights %>%
mutate(hour = sched_dep_time %/% 100) %>%
group_by(origin, year, month, day, hour) %>%
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ungroup() %>%
arrange(desc(dep_delay)) %>%
slice(1:48)
```
Then I can use `semi_join()` to get the weather for these hours.
```{r}
weather_most_delayed <- semi_join(weather, worst_hours,
by = c("origin", "year",
"month", "day", "hour"))
```
For weather, I'll focus on precipitation, wind speed, and temperature.
I will display these in both a table and a plot.
Many of these observations have a higher than average wind speed (10 mph) or some precipitation.
However, I would have expected the weather for the hours with the worst delays to be much worse.
```{r}
select(weather_most_delayed, temp, wind_speed, precip) %>%
print(n = 48)
```
```{r}
ggplot(weather_most_delayed, aes(x = precip, y = wind_speed, color = temp)) +
geom_point()
```
It's hard to say much more than that without using the tools from [Exploratory Data Analysis](https://r4ds.had.co.nz/exploratory-data-analysis.html#covariation) section
to look for covariation between weather and flight delays using all flights.
Implicitly in my informal analysis of trends in weather using only the 48 hours with the worst delays, I was comparing the weather in these hours to some belief I had about what constitutes "normal" or "good" weather.
It would be better to actually use data to make that comparison.
</div>
### Exercise 13.5.5 {.unnumbered .exercise data-number="13.5.5"}
<div class="question">
What does `anti_join(flights, airports, by = c("dest" = "faa"))` tell you?
What does `anti_join(airports, flights, by = c("faa" = "dest"))` tell you?
</div>
<div class="answer">
The expression `anti_join(flights, airports, by = c("dest" = "faa"))` returns the flights that went to an airport that is not in the FAA list of destinations.
Since the FAA list only contains domestic airports, these are likely foreign flights.
However, running that expression that there are only four airports in this list.
```{r}
anti_join(flights, airports, by = c("dest" = "faa")) %>%
distinct(dest)
```
In this set of four airports three are in Puerto Rico ([BQN](https://en.wikipedia.org/wiki/Rafael_Hern%C3%A1ndez_Airport), [SJU](https://en.wikipedia.org/wiki/Luis_Mu%C3%B1oz_Mar%C3%ADn_International_Airport), and [PSE](https://en.wikipedia.org/wiki/Mercedita_International_Airport)) and one is in the US Virgin Islands ( [STT](https://en.wikipedia.org/wiki/Cyril_E._King_Airport)).
The reason for this discrepancy is that the `flights` and `airports` tables are derived from different sources.
The `flights` data comes from the US Department of Transportation [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236), while the airport metadata comes from [openflights.org](https://openflights.org/data.html).
The BTS includes Puerto Rico and U.S. Virgin Islands as "domestic" (part of the US), while the [openflights.org](https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat) give use different values of country for airports in the US states (`"United States"`) Puerto Rico (`"Puerto Rico"`) and US Virgin Islands (`"Virgin Islands"`).
The expression `anti_join(airports, flights, by = c("faa" = "dest"))` returns the US airports that were not the destination of any flight in the data.
Since the data contains all flights from New York City airports, this is also the list of US airports that did not have a nonstop flight from New York City in 2013.
```{r}
anti_join(airports, flights, by = c("faa" = "dest"))
```
</div>
### Exercise 13.5.6 {.unnumbered .exercise data-number="13.5.6"}
<div class="question">
You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline.
Confirm or reject this hypothesis using the tools you’ve learned above.
</div>
<div class="answer">
At each point in time, each plane is flown by a single airline.
However, a plane can be sold and fly for multiple airlines.
Logically, it is possible that a plane can fly for multiple airlines over the course of its lifetime.
But, it is not necessarily the case that a plane will fly for more than one airline in this data, especially since it comprises only a year of data.
So let's check to see if there are any planes in the data flew for multiple airlines.
First, find all distinct airline, plane combinations.
```{r planes_carriers}
planes_carriers <-
flights %>%
filter(!is.na(tailnum)) %>%
distinct(tailnum, carrier)
```
The number of planes that have flown for more than one airline are those `tailnum` that appear more than once in the `planes_carriers` data.
```{r}
planes_carriers %>%
count(tailnum) %>%
filter(n > 1) %>%
nrow()
```
The names of airlines are easier to understand than the two-letter carrier codes.
The `airlines` data frame contains the names of the airlines.
```{r carrier_transfer_tbl}
carrier_transfer_tbl <- planes_carriers %>%
# keep only planes which have flown for more than one airline
group_by(tailnum) %>%
filter(n() > 1) %>%
# join with airlines to get airline names
left_join(airlines, by = "carrier") %>%
arrange(tailnum, carrier)
carrier_transfer_tbl
```
</div>
## Join problems {#join-problems .r4ds-section}
`r no_exercises()`
## Set operations {#set-operations .r4ds-section}
`r no_exercises()`