-
Notifications
You must be signed in to change notification settings - Fork 0
/
MHendry_02_data-wrangling.Rmd
316 lines (229 loc) · 11.5 KB
/
MHendry_02_data-wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
---
title: "3.2: Data Wrangling and Visualization"
author: "Katie Willi"
date: "`r Sys.Date()`"
output: html_document
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo=TRUE, eval=TRUE, message=FALSE, warning=FALSE, rows.print=5, fig.width=11)
```
### Lesson Objectives
In the last lesson, we learned how to pull data from an API and reduce redundancies in our workflows through functions and iteration. In this lesson we will use the functions in the previous lesson to learn how to manipulate data frames with the {tidyverse}, and plot elegant time series graphs with the {ggplot2}, {scales} and {plotly} packages.
There are **five exercises** in this lesson that must be completed.
## Pulling in necessary packages and data sets
```{r}
library(tidyverse) # ggplot2 is included in the {tidyverse}
library(httr)
library(jsonlite)
library(plotly)
library(scales)
```
Using the `parkwide_visitation()` function from the last lesson and mapping, let's pull park-wide visitor data from 1980-2021, and name the final object `parkwide`. (Code hack: we can use `1980:2021` to create a vector of years so we don't have to write each year out!)
```{r}
parkwide_visitation <- function(year){
# pull in the data
raw_data <- httr::GET(url =
# parse out year so that it can be chosen with the "year" argument, using paste0()
paste0("https://irmaservices.nps.gov/v3/rest/stats/total/", year))
# convert content to text
extracted_data <- httr::content(raw_data, as = "text", encoding = "UTF-8")
# parse text from JSON to data frame
final_data <- jsonlite::fromJSON(extracted_data)
return(final_data)
}
years <- (1980:2021)
parkwide <- years %>%
map(~ parkwide_visitation(year = .x)) %>%
bind_rows()
```
### Exercise #1 {style="color: maroon"}
**Using the `unit_visitation()` function from the last lesson and mapping, pull visitor data from 1980-2021 for the following park units: ROMO, ACAD, LAKE, YELL, GRCA, ZION, OLYM, and GRSM. Name the final output `units`.**
```{r}
unit_visitation <- function(unit, start_month = "1", start_year, end_month = "12", end_year) {
url <- httr::GET(url = paste0("https://irmaservices.nps.gov/v3/rest/stats/visitation?unitCodes=", unit,
"&startMonth=", start_month,
"&startYear=", start_year,
"&endMonth=", end_month,
"&endYear=", end_year))
data_messy <- httr::content(url, as = "text", encoding = "UTF-8")
data_clean <- jsonlite::fromJSON(data_messy)
return(data_clean)
}
parks <- c("ROMO", "ACAD", "LAKE", "YELL", "GRCA", "ZION", "OLYM", "GRSM")
units <- parks %>%
map(~ unit_visitation(unit = .x, start_year = "1980", end_year = "2021")) %>%
bind_rows()
```
## Exploring our data
Look at the data frame structure of `parkwide` and `units`; they're exactly the same! So let's go ahead and bind those together:
```{r}
visitation <- bind_rows(parkwide, units)
```
... except, the rows in `parkwide`'s UnitCode and UnitCode columns are empty. 😑 Let's fix the `UnitCode` column to list "Parkwide" using `mutate()` and an `ifelse()` statement:
```{r}
visitation <- visitation %>% mutate(UnitCode = if_else(is.na(UnitCode), "Parkwide", UnitCode))
```
Think of the above `if_else()` operation as: "If the column `UnitCode` is `NA`, replace `NA` with `Parkwide`. Otherwise, preserve what is already in the `UnitCode` column."
Now that we have a single data set containing all of the NPS visitation data that we've pulled, let's start exploring it! But first, let's aggregate the monthly data into annual data using `group_by()` and `summarize()`:
```{r}
annual_visitation <- visitation %>%
group_by(UnitCode, Year) %>%
# we only care about recreational visitors:
summarize(RecVisitation = sum(RecreationVisitors))
annual_visitation
```
What does visitation data look like through time? First we can try to graph all of the park units together:
```{r}
ggplot(data=annual_visitation)+
geom_point(aes(x = Year, y = RecVisitation, color = UnitCode)) +
geom_path(aes(x = Year, y = RecVisitation, color = UnitCode)) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10)
```
... yikes, not surprisingly, parkwide visitation is wayyyy higher than our individual unit's visitation data, making our graph pretty useless. It might be nice to have each park unit in a graph of its own.
We can create individual graphs for each unit using `facet_wrap()`, and set the y-axes for each plot to `"free_y"`:
```{r}
ggplot(data=annual_visitation) +
geom_point(aes(x = Year, y = RecVisitation, color = UnitCode)) +
geom_path(aes(x = Year, y = RecVisitation, color = UnitCode)) +
scale_y_continuous(labels = scales::label_scientific()) +
facet_wrap(~UnitCode, scales = "free_y") +
theme_bw(base_size=10)
```
We can also make this plot interactive by feeding it into `plotly`'s `ggplotly()` function:
```{r}
plotly::ggplotly(
ggplot(data=annual_visitation) +
geom_point(aes(x = Year, y = RecVisitation, color = UnitCode)) +
geom_path(aes(x = Year, y = RecVisitation, color = UnitCode)) +
scale_y_continuous(labels = scales::label_scientific()) +
facet_wrap(~UnitCode, scales = "free_y") +
theme_bw(base_size=10)
)
```
### Exercise #2 {style="color: maroon"}
**Create an interactive graph with two separate panes: one showing park-wide visitation, the other showing all the individual park units together. Both panes should have different y-axes.**
```{r}
# facet, create a new variable for the pane
indiv_parks <- annual_visitation %>%
filter(UnitCode == c("ACAD", "GRCA", "GRSM", "LAKE", "OLYM", "ROMO", "YELL", "ZION"))
just_parkwide <- annual_visitation %>%
filter(UnitCode == "Parkwide")
plot_parkwide <- plotly::ggplotly(ggplot(data=indiv_parks)+
geom_point(aes(x = Year, y = RecVisitation, color = UnitCode)) +
geom_path(aes(x = Year, y = RecVisitation, color = UnitCode)) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10))
plot_parks <- plotly::ggplotly(ggplot(data=just_parkwide)+
geom_point(aes(x = Year, y = RecVisitation, color = UnitCode)) +
geom_path(aes(x = Year, y = RecVisitation, color = UnitCode)) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10))
library(ggpubr)
ggarrange(plot_parkwide, plot_parks, rremove("x.text"), labels = c("Parkwide", "Individual Parks"), ncol = 2, nrow = 1)
```
It is pretty clear that some park units get orders of magnitude more visitors than others. But just how much of the total park visitation do each of these parks account for from year to year? Here we walk through two methods to tackle this question, ***pivoting*** and ***joining***, to get park unit visitation side-by-side with park-wide data.
## Pivoting
Currently, our annual visitation data is considered *long* because we have all of our NPS visitation data in one column, with multiple rows representing the same year. We can make this data *wide* by using the function `pivot_wider()`
```{r}
wide_data <- annual_visitation %>%
select(Year, UnitCode, RecVisitation) %>%
pivot_wider(., names_from = UnitCode, values_from = RecVisitation)
```
... where `names_from` represents the column with the values you are hoping to spread into new columns, and `values_from` represents the data you want to fill these new columns with.
We can make the data set *long* again by using the function `pivot_longer()`:
```{r}
long_data <- wide_data %>%
pivot_longer(cols = -Year,
names_to = "Park",
values_to = "RecVisitation")
```
... where `cols` are the columns we want to gather into one column (or, the column(s) you DON'T want to gather), while `names_to` and `values_to` are the names and values for the new columns produced from the pivot.
### Exercise #3 {style="color: maroon"}
**Using `wide_data` as the starting point, create an interactive time series plot showing the annual percentage of the total visitation made up by all park units. In other words, a visual that allows us to see how much each park unit contributes to the total park visitation across the NPS system.**
```{r}
plotly::ggplotly(
ggplot(data=wide_data) +
geom_point(aes(x = Year, y = ACAD/Parkwide)) +
geom_path(aes(x = Year, y = ACAD/Parkwide)) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10)
)
plotly::ggplotly(
ggplot(data=wide_data) +
geom_point(aes(x = Year, y = GRCA/Parkwide)) +
geom_path(aes(x = Year, y = GRCA/Parkwide)) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10)
)
plotly::ggplotly(
ggplot(data=wide_data) +
geom_point(aes(x = Year, y = GRSM/Parkwide)) +
geom_path(aes(x = Year, y = GRSM/Parkwide)) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10)
)
plotly::ggplotly(
ggplot(data=wide_data) +
geom_point(aes(x = Year, y = LAKE/Parkwide)) +
geom_path(aes(x = Year, y = LAKE/Parkwide)) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10)
)
plotly::ggplotly(
ggplot(data=wide_data) +
geom_point(aes(x = Year, y = OLYM/Parkwide)) +
geom_path(aes(x = Year, y = OLYM/Parkwide)) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10)
)
plotly::ggplotly(
ggplot(data=wide_data) +
geom_point(aes(x = Year, y = ROMO/Parkwide)) +
geom_path(aes(x = Year, y = ROMO/Parkwide)) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10)
)
plotly::ggplotly(
ggplot(data=wide_data) +
geom_point(aes(x = Year, y = YELL/Parkwide)) +
geom_path(aes(x = Year, y = YELL/Parkwide)) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10)
)
plotly::ggplotly(
ggplot(data=wide_data) +
geom_point(aes(x = Year, y = ZION/Parkwide)) +
geom_path(aes(x = Year, y = ZION/Parkwide)) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10)
)
```
## Joining
Another way of getting park-wide visitation side-by-side with the park unit data is through the use of joining our original `units` and `parkwide` data sets:
```{r}
joined_data <- inner_join(x = units, y = parkwide, by = c("Year","Month"))
```
... where `x` and `y` are the two data sets you want joined, and `by` indicates the column(s) to match them by. Note: there are several ways of joining data. Explore them with `` ?`mutate-joins` `` and `` ?`filter-joins` ``.
### Exercise #4 {style="color: maroon"}
**Using `joined_data` as the starting point, create an interactive time series plot showing the annual percentage of the total visitation made up by all park units. This plot should look nearly identical to the previous plot.**
```{r}
joined_data <- joined_data %>%
mutate(percentage = (RecreationVisitors.x / sum(RecreationVisitors.x)) * 100)
plotly::ggplotly(
ggplot(data=joined_data) +
geom_point(aes(x = Year, y = percentage, color = UnitCode.x)) +
geom_path(aes(x = Year, y = percentage, color = UnitCode.x) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10)
))
```
### Exercise #5 {style="color: maroon"}
**Which park on average has the most visitation? Which park has the least visitation? Base your response on the data starting in 1990, ending in 2021. Defend your answer with numbers!**
```{r}
# could not get above plot to run, so :
avg_visitiation <- joined_data %>%
group_by(UnitCode.x) %>%
summarise(Total_percentage = mean(percentage))
```
# GRSM has the most visitation on average, while ZION has the least.