-
Notifications
You must be signed in to change notification settings - Fork 0
/
01_api-funct-iter.Rmd
286 lines (173 loc) · 14.5 KB
/
01_api-funct-iter.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
title: "Lesson 3.1: API Calls, Functions, and Iterations"
author: "Katie Willi"
date: "`r Sys.Date()`"
output: html_document
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo=TRUE, eval=TRUE, message=FALSE, warning=FALSE, rows.print=5)
```
### Lesson Objectives
In this lesson we will download data using an application programming interface (API), create our own functions, and iterate using `for` loops and `map()`. To fulfill these objectives we will be utilizing two park visitation data sets from the National Park Service (NPS): NPS-wide visitation data, and park unit-specific visitation data.
There are **seven exercises** in this lesson that must be completed.
# APIs
An API is software that acts as an intermediary between an online data warehouse (or server) and its users (or clients). As data scientists, APIs provide us a way to request clean and nicely-formatted data that the server will then send to our local computers, all within our RStudio console! To work with APIs, we will need to use two new packages: `httr`, which allows us to communicate with the API's server, and `jsonlite`, which allows us to work with one of the most common API data formats, JSON. Let's go ahead and load in our packages for this lesson:
```{r}
library(tidyverse)
library(httr)
library(jsonlite)
```
## NPS Visitation Data
This week, we will be exploring NPS visitor use data across the NPS system as a whole, and and across specific park units. Like many institutions, NPS has a server that stores all of this information (as well as many other things), and an API for users to be able to access it.
To utilize the NPS API in R, we first need to explore its [API's data structure](https://irmaservices.nps.gov/). In almost every case, we use URLs to access specific data from APIs. To find the access URL for NPS visitation data, go to [Stats Rest API - Documentation](https://irmaservices.nps.gov/v3/rest/stats/help) (though not very intuitive, the NPS API calls its visitation data set "Stats"). Listed there you will see that all data associated with the "Stats" data set can be accessed using the base URL [**https://irmaservices.nps.gov/v3/rest/stats**](https://irmaservices.nps.gov/v3/rest/stats){.uri}. From there, you can tack on additional html text to access two different data sets: **total/{year}** and **visitation**.
For starters, let's try accessing the **total/{year}**. This data set gives us total monthly visitation across all NPS park units, for a user-selected year:
[**https://irmaservices.nps.gov/v3/rest/stats/total/{YEAR}**](https://irmaservices.nps.gov/v3/rest/stats/total/%7BYEAR%7D){.uri}
If you tried accessing that URL, you'll have noticed it doesn't take you anywhere. This is because the curly brackets {} signify locations in the URL that need to be updated by the user based on their specific needs. I'm curious about visitor use in my birth year, so let's tweak the URL to access visitation data from 1992. In R, we can access this data using `httr`'s `GET()` function, replacing {YEAR} with 1992.
```{r}
raw_data <- httr::GET(url = "https://irmaservices.nps.gov/v3/rest/stats/total/1992")
glimpse(raw_data)
```
Viewing the data set as-is, you can see it is not super human-readable. This is because data sent from APIs is typically packaged using JavaScript Object Notation (JSON).
To unpack the data, we will first need to use `httr`'s `content()` function. In this example, we want the data to be extracted as *text,* since this is a data table. Moreover, its encoding is listed as *UTF-8*. The encoding parameter can be found by opening our raw data set in our R console:
```{r}
raw_data # lists 'UTF-8'
# convert content to text
unpacked_data <- httr::content(raw_data, as = "text", encoding = "UTF-8")
```
Second, we need to transform this string of text, which is still in JSON formatting, into a data frame using `jsonlite`'s `fromJSON()`:
```{r}
# parse text from JSON to data frame
final_data <- jsonlite::fromJSON(unpacked_data)
final_data
```
Hooray, you have now successfully pulled in an online data set using an API! 😁
### Exercise #1 {style="color: maroon"}
**Using the code above as a starting point, pull in monthly NPS-wide visitation data for the years 1980, 1999, and 2018.**
```{r}
```
### Exercise #2 {style="color: maroon"}
**Now, let's explore the second NPS visitation data set, [visitation](https://irmaservices.nps.gov/v3/rest/stats/help/operations/FetchVisitation). This call pulls in monthly data for a specific park, across a specific time frame. Use your new API skills to pull in visitation data for Rocky Mountain National Park from 2010 through 2021, based on the API's URL template. The unit code for Rocky Mountain National Park is ROMO. (Hint: an API URL can have multiple sections that need to be updated by the user; this one requires the starting month and year, the ending month and year, and the park unit of interest.)**
```{r}
```
# Functions
You may find yourself thinking, *"Wow, exercise 1 was overkill!"* Indeed, you had to run several lines of code that were nearly identical to what was shown upstream; the only thing you needed to change from one year to the next was the year itself. This sort of redundant coding is not good coding practice. Instead of copying and pasting many coding steps over and over again and tweaking just a tiny portion of it, we can write *functions* that combine many coding steps into just one command. The benefits of reducing redundant code in this way are threefold. As Grolemund & Wickham describe in their book, [*R for Data Science*](https://r4ds.had.co.nz/):
> 1. It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same.
> 2. It's easier to respond to changes in requirements. As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.
> 3. You're likely to have fewer bugs because each line of code is used in more places.
*Functions* provide the option of changing just a minor part of the code base from one run to the next. Think of the `GET()` function in `httr`: it is a function that has code under-the-hood so that it isn't necessary to write out the raw code each time we use it. Instead, we call out the function's name (`GET()`), and the necessary argument within that function that tweaks the code to fit it to our needs (`url = "<SOME_URL_WE_CHOOSE>"`).
## Functionize API Pulls
Let's try making a function called `parkwide_visitation()` that pulls in NPS-wide visitation data for a year of choice. To develop a function requires specific formatting:
```{r, eval = F}
<NAME> <- function(<ARGUMENTS>){
<ACTIONS>
return(<OUTPUT>)
}
```
... where NAME is what we want to name the function; ARGUMENTS are the variables in the code that get "tweaked"; ACTIONS are the lines of code we want the function to perform (which includes our ARGUMENTS); and the OUTPUT is the object we want as the final outcome of running the function.
For `parkwide_visitation()`, we will use our upstream code as the basis for our function, but with a few minor yet extremely important tweaks:
```{r}
parkwide_visitation <- function(year){
# pull in the data
raw_data <- httr::GET(url =
# parse out year so that it can be chosen with the "year" argument, using paste0()
paste0("https://irmaservices.nps.gov/v3/rest/stats/total/", year))
# convert content to text
extracted_data <- httr::content(raw_data, as = "text", encoding = "UTF-8")
# parse text from JSON to data frame
final_data <- jsonlite::fromJSON(extracted_data)
return(final_data)
}
```
In the above function, our first object, `raw_data`, now changes based on how we define our year argument. We accomplish this through `paste0()`, which takes listed objects, transforms them into characters (if they aren't already), and concatenates them into a single character string. For example:
```{r}
my_sentence <- "I need at least"
my_other_sentence <- "pints of ice cream a day"
paste0(my_sentence, " ", 08, " ", my_other_sentence, "!")
```
So, if we make `year = 2021` in our `parkwide_visitation()` function, the year object becomes the number 2021, which makes the `paste0()` output "<https://irmaservices.nps.gov/v3/rest/stats/total/2021>", which subsequently pulls data for 2021. In other words, we can now pull visitation data for any year with just one line of code!
```{r}
pull_2018 <- parkwide_visitation(year = 2018)
pull_1980 <- parkwide_visitation(year = 1980)
pull_1992 <- parkwide_visitation(year = 1992)
# ... and so on!
```
### Exercise #3 {style="color: maroon"}
**Create a function called `unit_visitation()` that pulls park-specific visitation data for any park, across any time frame. For a list of all park codes, download [this spreadsheet](https://www.nps.gov/aboutus/foia/upload/NPS-Unit-List.xlsx). (Hint 1: functions can have multiple arguments. For this step, you will want arguments representing the start and end month and year, and park unit). Hint 2: Exercise 2 should be used as a starting point for making this function.)**
```{r}
```
### Exercise #4 {style="color: maroon"}
**Using `unit_visitation()`, pull in visitation data for Rocky Mountain National Park (ROMO), Everglades National Park (EVER), and Theodore Roosevelt National Park (THRO) from November 1992 through December 2021.**
```{r}
```
## Function Defaults
Look at the code that you just wrote; writing out all of those unchanging date arguments still feels repetitive, right? When developing functions, there is an option for setting default values for arguments so that you don't necessarily have to write all of them out every time you run it in the future. But, the option still exists within the function to make changes when necessary. For example, let's tweak our `parkwide_visitaion()` function to have the default year be 2021:
```{r}
parkwide_visitation <- function(year = "2021") {
raw_data <- httr::GET(url = paste0("https://irmaservices.nps.gov/v3/rest/stats/total/", year))
# convert content to text
extracted_data <- httr::content(raw_data, as = "text", encoding = "UTF-8")
# parse text from JSON to data frame
final_data <- jsonlite::fromJSON(extracted_data)
return(final_data)
}
parkwide_visitation()
```
Because the default year is 2021, you don't have to write it out explicitly in the function (so long as that's the year you're interested in). But, you still have the option of changing the year to something else:
```{r}
parkwide_visitation(year = "1992")
```
### Exercise #5 {style="color: maroon"}
**For our `unit_visitation()` function, make the default arguments for the start and end months January and December, respectively. This way, we are automatically pulling in data for an entire year. Then, rerun the updated 'unit_visitation()' function for ROMO, EVER, and THRO for the 1980-2021 time period to make sure it works properly.**
```{r}
```
# Iterations
At this point, we now know how to develop functions so that we do not have to keep writing out redundant steps in a workflow. However, in that last exercise, you can see that we are *still* writing out redundant code; we are performing the exact same function on each of our three park units.
Another tool for reducing redundancy is **iteration**, which allows you to do the same thing on multiple inputs. Iteration can happen across different objects, different rows, different data frames, the list goes on and on!
## For loops
A `for` loop is base R's iteration tool that executes code across a vector, an array, a list, etc. To save the outcome of each iteration, you must first create a vector to store the outputs in that is sized based on how many objects you want to iterate over. For example, I want to run our `parkwide_visitation()` function over the last five years: 2017, 2018, 2019, 2020, and 2021. To do that, I will first need to develop a vector listing each year:
```{r}
years <- c('2017', '2018', '2019', '2020', '2021')
```
... and then develop an empty list to store each year's `parkwide_visitation()` results (i.e., output) into:
```{r}
output_floop <- vector("list", length = length(years))
```
Now that we have a place to store each year's function results, we can move forward with the for loop itself:
```{r}
for(i in 1:length(years)){
output_floop[[i]] <-
parkwide_visitation(year = years[i])
}
```
... where `years[i]` tells the `for` loop to perform `parkwide_visitation()` on the *i^th^* year (think of *i* as a moving across each year), and `output_floop[[i]]` directs the `for` loop to store the results of the *i^th^* year's run into `output`'s *i^th^* list (think of `output_floop[[i]]` as the location in `output_floop` that the *i^th^*'s results go).
We now have a list containing five data frames: one for each year of visitation data:
```{r}
summary(output_floop)
```
Because each year's output is structured identically, we can confidently combine each year's data frame into a single data frame using `dplyr::bind_rows()`:
```{r}
multi_years <- dplyr::bind_rows(output_floop)
```
### Exercise #6 {style="color: maroon"}
**Use a for loop to run `unit_visitation()` with arguments `start_year = 1980` and `end_year = 2021` across ROMO, EVER, and THRO. Then, create a single data frame containing each park units' output. (Hint: Your first step will be to create a vector listing each park unit.)**
```{r}
```
## Mapping
The `tidyverse`'s `purrr` package has its own iteration function, `map()`, that is a variation of the `for` loop. `map()` takes a vector and applies a single function across it, then automatically stores all of the results into a list. In other words, `map()` creates an appropriately sized list to store our results in for us. This eliminates the need to create an empty list ahead of time.
To create the same output as our previous `for` loop on `parkwide_visitation()`, but using `map()` instead, we would run the following code:
```{r}
output_map <- years %>%
map(~ parkwide_visitation(year = .))
```
... where `~` indicates that we want to perform `parkwide_visitation()` across all years, and `.` indicates that we want to use our piped vector, `years`, as the input to the `year` argument. As you can see, `output_map` is identical to `output_floop`:
```{r}
identical(output_floop, output_map)
```
... which means we should also `bind_rows()` to get the mapped output into a single data frame:
```{r}
multi_years <- bind_rows(output_map)
```
### Exercise #7 {style="color: maroon"}
**Use `map()` to run `unit_visitation()` with arguments `start_year = 1980` and `end_year = 2021` across ROMO, EVER, and THRO. Then, create a single data frame containing each park units' output.**
```{r}
```