MHendry_02_data-wrangling.Rmd

---
title: "3.2: Data Wrangling and Visualization"
author: "Katie Willi"
date: "`r Sys.Date()`"
output: html_document
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo=TRUE, eval=TRUE, message=FALSE, warning=FALSE, rows.print=5, fig.width=11)
```

### Lesson Objectives

In the last lesson, we learned how to pull data from an API and reduce redundancies in our workflows through functions and iteration. In this lesson we will use the functions in the previous lesson to learn how to manipulate data frames with the {tidyverse}, and plot elegant time series graphs with the {ggplot2}, {scales} and {plotly} packages.

There are **five exercises** in this lesson that must be completed.

## Pulling in necessary packages and data sets

```{r}
library(tidyverse) # ggplot2 is included in the {tidyverse}
library(httr)
library(jsonlite)
library(plotly)
library(scales)

```

Using the `parkwide_visitation()` function from the last lesson and mapping, let's pull park-wide visitor data from 1980-2021, and name the final object `parkwide`. (Code hack: we can use `1980:2021` to create a vector of years so we don't have to write each year out!)

```{r}
parkwide_visitation <- function(year){

# pull in the data
raw_data <- httr::GET(url = 
                        # parse out year so that it can be chosen with the "year" argument, using paste0()
                        paste0("https://irmaservices.nps.gov/v3/rest/stats/total/", year))

# convert content to text
extracted_data <- httr::content(raw_data, as = "text", encoding = "UTF-8") 

# parse text from JSON to data frame
final_data <- jsonlite::fromJSON(extracted_data)

return(final_data)

}

years <- (1980:2021)

parkwide <- years %>% 
  map(~ parkwide_visitation(year = .x)) %>% 
  bind_rows()
```

### Exercise #1 {style="color: maroon"}

**Using the `unit_visitation()` function from the last lesson and mapping, pull visitor data from 1980-2021 for the following park units: ROMO, ACAD, LAKE, YELL, GRCA, ZION, OLYM, and GRSM. Name the final output `units`.**

```{r}
unit_visitation <- function(unit, start_month = "1", start_year, end_month = "12", end_year) {

url <- httr::GET(url = paste0("https://irmaservices.nps.gov/v3/rest/stats/visitation?unitCodes=", unit, 
                                   "&startMonth=", start_month,
                                     "&startYear=", start_year,
                                     "&endMonth=", end_month,
                                     "&endYear=", end_year))

 data_messy <- httr::content(url, as = "text", encoding = "UTF-8")
  
  data_clean <- jsonlite::fromJSON(data_messy)
  
  return(data_clean)
}

parks <- c("ROMO", "ACAD", "LAKE", "YELL", "GRCA", "ZION", "OLYM", "GRSM")

units <- parks %>%
  map(~ unit_visitation(unit = .x, start_year = "1980", end_year = "2021")) %>%
  bind_rows()

```

## Exploring our data

Look at the data frame structure of `parkwide` and `units`; they're exactly the same! So let's go ahead and bind those together:

```{r}
visitation <- bind_rows(parkwide, units)
```

... except, the rows in `parkwide`'s UnitCode and UnitCode columns are empty. 😑 Let's fix the `UnitCode` column to list "Parkwide" using `mutate()` and an `ifelse()` statement:

```{r}
visitation <- visitation %>% mutate(UnitCode = if_else(is.na(UnitCode), "Parkwide", UnitCode))
```

Think of the above `if_else()` operation as: "If the column `UnitCode` is `NA`, replace `NA` with `Parkwide`. Otherwise, preserve what is already in the `UnitCode` column."

Now that we have a single data set containing all of the NPS visitation data that we've pulled, let's start exploring it! But first, let's aggregate the monthly data into annual data using `group_by()` and `summarize()`:

```{r}
annual_visitation <- visitation %>%
  group_by(UnitCode, Year) %>% 
  # we only care about recreational visitors:
  summarize(RecVisitation = sum(RecreationVisitors))

annual_visitation
```

What does visitation data look like through time? First we can try to graph all of the park units together:

```{r}
ggplot(data=annual_visitation)+
  geom_point(aes(x = Year, y = RecVisitation, color = UnitCode)) +
  geom_path(aes(x = Year, y = RecVisitation, color = UnitCode)) +
  scale_y_continuous(labels = scales::label_scientific()) +
  theme_bw(base_size=10)
```

... yikes, not surprisingly, parkwide visitation is wayyyy higher than our individual unit's visitation data, making our graph pretty useless. It might be nice to have each park unit in a graph of its own.

We can create individual graphs for each unit using `facet_wrap()`, and set the y-axes for each plot to `"free_y"`:

```{r}
ggplot(data=annual_visitation) +
  geom_point(aes(x = Year, y = RecVisitation, color = UnitCode)) +
  geom_path(aes(x = Year, y = RecVisitation, color = UnitCode)) +
  scale_y_continuous(labels = scales::label_scientific()) +
  facet_wrap(~UnitCode, scales = "free_y") +
  theme_bw(base_size=10)
```

We can also make this plot interactive by feeding it into `plotly`'s `ggplotly()` function:

```{r}
plotly::ggplotly(
  ggplot(data=annual_visitation) +
    geom_point(aes(x = Year, y = RecVisitation, color = UnitCode)) +
    geom_path(aes(x = Year, y = RecVisitation, color = UnitCode)) +
    scale_y_continuous(labels = scales::label_scientific()) +
    facet_wrap(~UnitCode, scales = "free_y") +
    theme_bw(base_size=10)
)
```

### Exercise #2 {style="color: maroon"}

**Create an interactive graph with two separate panes: one showing park-wide visitation, the other showing all the individual park units together. Both panes should have different y-axes.**

```{r}
# facet, create a new variable for the pane

indiv_parks <- annual_visitation %>%
  filter(UnitCode == c("ACAD", "GRCA", "GRSM", "LAKE", "OLYM", "ROMO", "YELL", "ZION"))

just_parkwide <- annual_visitation %>% 
  filter(UnitCode == "Parkwide")

plot_parkwide <- plotly::ggplotly(ggplot(data=indiv_parks)+
  geom_point(aes(x = Year, y = RecVisitation, color = UnitCode)) +
    
  geom_path(aes(x = Year, y = RecVisitation, color = UnitCode)) +
  scale_y_continuous(labels = scales::label_scientific()) +
  theme_bw(base_size=10))

plot_parks <- plotly::ggplotly(ggplot(data=just_parkwide)+
  geom_point(aes(x = Year, y = RecVisitation, color = UnitCode)) +
  geom_path(aes(x = Year, y = RecVisitation, color = UnitCode)) +
  scale_y_continuous(labels = scales::label_scientific()) +
  theme_bw(base_size=10))

library(ggpubr)

ggarrange(plot_parkwide, plot_parks, rremove("x.text"), labels = c("Parkwide", "Individual Parks"), ncol = 2, nrow = 1)
```

It is pretty clear that some park units get orders of magnitude more visitors than others. But just how much of the total park visitation do each of these parks account for from year to year? Here we walk through two methods to tackle this question, ***pivoting*** and ***joining***, to get park unit visitation side-by-side with park-wide data.

## Pivoting

Currently, our annual visitation data is considered *long* because we have all of our NPS visitation data in one column, with multiple rows representing the same year. We can make this data *wide* by using the function `pivot_wider()`

```{r}
wide_data <- annual_visitation %>%
  select(Year, UnitCode, RecVisitation) %>%
  pivot_wider(., names_from = UnitCode, values_from = RecVisitation)
```

... where `names_from` represents the column with the values you are hoping to spread into new columns, and `values_from` represents the data you want to fill these new columns with.

We can make the data set *long* again by using the function `pivot_longer()`:

```{r}
long_data <- wide_data %>%
  pivot_longer(cols = -Year,
               names_to = "Park",
               values_to = "RecVisitation")
```

... where `cols` are the columns we want to gather into one column (or, the column(s) you DON'T want to gather), while `names_to` and `values_to` are the names and values for the new columns produced from the pivot.

### Exercise #3 {style="color: maroon"}

**Using `wide_data` as the starting point, create an interactive time series plot showing the annual percentage of the total visitation made up by all park units. In other words, a visual that allows us to see how much each park unit contributes to the total park visitation across the NPS system.**

```{r}
plotly::ggplotly(
  ggplot(data=wide_data) +
    geom_point(aes(x = Year, y = ACAD/Parkwide)) +
    geom_path(aes(x = Year, y = ACAD/Parkwide)) +
    scale_y_continuous(labels = scales::label_scientific()) +
    theme_bw(base_size=10)
)

plotly::ggplotly(
  ggplot(data=wide_data) +
    geom_point(aes(x = Year, y = GRCA/Parkwide)) +
    geom_path(aes(x = Year, y = GRCA/Parkwide)) +
    scale_y_continuous(labels = scales::label_scientific()) +
    theme_bw(base_size=10)
)

plotly::ggplotly(
  ggplot(data=wide_data) +
    geom_point(aes(x = Year, y = GRSM/Parkwide)) +
    geom_path(aes(x = Year, y = GRSM/Parkwide)) +
    scale_y_continuous(labels = scales::label_scientific()) +
    theme_bw(base_size=10)
)

plotly::ggplotly(
  ggplot(data=wide_data) +
    geom_point(aes(x = Year, y = LAKE/Parkwide)) +
    geom_path(aes(x = Year, y = LAKE/Parkwide)) +
    scale_y_continuous(labels = scales::label_scientific()) +
    theme_bw(base_size=10)
)

plotly::ggplotly(
  ggplot(data=wide_data) +
    geom_point(aes(x = Year, y = OLYM/Parkwide)) +
    geom_path(aes(x = Year, y = OLYM/Parkwide)) +
    scale_y_continuous(labels = scales::label_scientific()) +
    theme_bw(base_size=10)
)

plotly::ggplotly(
  ggplot(data=wide_data) +
    geom_point(aes(x = Year, y = ROMO/Parkwide)) +
    geom_path(aes(x = Year, y = ROMO/Parkwide)) +
    scale_y_continuous(labels = scales::label_scientific()) +
    theme_bw(base_size=10)
)

plotly::ggplotly(
  ggplot(data=wide_data) +
    geom_point(aes(x = Year, y = YELL/Parkwide)) +
    geom_path(aes(x = Year, y = YELL/Parkwide)) +
    scale_y_continuous(labels = scales::label_scientific()) +
    theme_bw(base_size=10)
)

plotly::ggplotly(
  ggplot(data=wide_data) +
    geom_point(aes(x = Year, y = ZION/Parkwide)) +
    geom_path(aes(x = Year, y = ZION/Parkwide)) +
    scale_y_continuous(labels = scales::label_scientific()) +
    theme_bw(base_size=10)
)


```

## Joining

Another way of getting park-wide visitation side-by-side with the park unit data is through the use of joining our original `units` and `parkwide` data sets:

```{r}
joined_data <- inner_join(x = units, y = parkwide, by = c("Year","Month"))
```

... where `x` and `y` are the two data sets you want joined, and `by` indicates the column(s) to match them by. Note: there are several ways of joining data. Explore them with `` ?`mutate-joins` `` and `` ?`filter-joins` ``.

### Exercise #4 {style="color: maroon"}

**Using `joined_data` as the starting point, create an interactive time series plot showing the annual percentage of the total visitation made up by all park units. This plot should look nearly identical to the previous plot.**

```{r}

joined_data <- joined_data %>%
  mutate(percentage = (RecreationVisitors.x / sum(RecreationVisitors.x)) * 100)

plotly::ggplotly(
  ggplot(data=joined_data) +
    geom_point(aes(x = Year, y = percentage, color = UnitCode.x)) +
    geom_path(aes(x = Year, y = percentage, color = UnitCode.x) +
    scale_y_continuous(labels = scales::label_scientific()) +
    theme_bw(base_size=10)
))

```

### Exercise #5 {style="color: maroon"}

**Which park on average has the most visitation? Which park has the least visitation? Base your response on the data starting in 1990, ending in 2021. Defend your answer with numbers!**

```{r}
# could not get above plot to run, so :

avg_visitiation <- joined_data %>%
  group_by(UnitCode.x) %>%
  summarise(Total_percentage = mean(percentage))

```
# GRSM has the most visitation on average, while ZION has the least.