diff --git a/01_Tidying_WQ_data_Shelhon.Rmd b/01_Tidying_WQ_data_Shelhon.Rmd new file mode 100644 index 0000000..eb3eb40 --- /dev/null +++ b/01_Tidying_WQ_data_Shelhon.Rmd @@ -0,0 +1,377 @@ +--- +title: "Tidying Public Water Quality Data" +author: "Matthew Ross" +date: "`r format(Sys.time(), '%B %d, %Y')`" +output: + html_document: +editor_options: + markdown: + wrap: 72 +--- + +# Why public data sets? + +Working with large, open-access data sets can serve many purposes. It +can be an excellent way to explore new ideas, before investing in +field-work or experiments. It can be a great way to take local or +experimental results and expand them to different ecosystems, places, or +landscapes. Or, it can be an excellent way to build, validate, and test +ecological models on regional or national scales. + +So why doesn't everyone use public data? Well, it's often collected by a +variety of organizations, with different methods, units, and +inconsistent metadata. Together these issues make large public data sets +quite messy. This messiness can take many different forms, but at the +basic level it means that data is hard to analyze. This is not +necessarily because the data itself is bad, but because the way it is +organized is unclear and/or inconsistent. + +In this assignment, we will use the skills we've learned to convert a +messy data set into something ready for analysis. As always, we will +depend heavily on the {tidyverse}, an excellent series of packages that +make data manipulation beautiful and easy. We will also be working with +the US's [Water Quality Portal data +base](https://www.waterqualitydata.us/), which we will access with the +{dataRetrieval} package. + +## Loading key packages + +Let's first load in all of our necessary packages, which you can do with +the 'setup.R' script we've created for you. Some of these packages might +be new to you, such as: + +- `mapview` : Similar to `tmap`, another package that allows us to + interactively map our data + +- `broom` : simplified model outputs + +- `trend` : Package to explore trends in data + +```{r setup, warnings = 'hide', message = FALSE} +source('setup.R') +``` + +# Downloading data + +For this lesson, we'll explore water quality data in the Colorado River +Basin as it moves from Colorado to Arizona. All data will be generated +through the code you see below, with the only external information +coming from knowing the site IDs for the monitoring locations along the +Colorado River and the water quality characteristic names we are +interested in. + +The water quality portal can be accessed with the function +`readWQPdata()`, which takes a variety of arguments (like start date, +end date, constituents, etc.). We'll generate these arguments for +downloading the data below. + +## Download prep + +First we'll make a tibble (aka, a tidyverse-style table/data frame) of +our site IDs of interest. (Generally, as the site ID's number increases, +the site's location moves further downstream of the river's headwaters, +which are near Grand Lake, CO.) + +```{r} +colorado <- tibble(siteid = c("USGS-09034500", "USGS-09069000", + "USGS-09085000", "USGS-09095500", "USGS-09152500"), + basin = c("colorado1", "eagle", + "roaring", "colorado3", "gunnison")) +``` + +Now that we have a tibble that contains a list of our sites of interest, +we next will need to set up a series of rules for downloading data from +the Water Quality Portal. + +We'll focus on cation and anion data from 1980-present. Each cation has +a name that we might typically use (like calcium or sulfate), but the +name may be different in the Water Quality Portal, so we have to check +this website +() +to get our names correct. + +```{r} +# ca = "Calcium" +# mg = "Magnesium" +# na = "Sodium" +# k = "Potassium" +# so4 = c("Sulfate", "Sulfate as SO4", "Sulfur Sulfate", "Total Sulfate") +# cl = "Chloride" +# hco3 = c("Alkalinity, bicarbonate", "Bicarbonate") + +# Compile all these names into a single vector: +parameters <- c("Calcium", "Magnesium", "Sodium", "Potassium", "Sulfate", "Sulfate as SO4", "Sulfur Sulfate", "Total Sulfate", "Chloride", "Alkalinity, bicarbonate", "Bicarbonate") +``` + +Using our site names and characteristic names generated above, we can +now run the `readWQPdata()` function to import water quality data for +each of our sites and parameters: + +```{r} +conc_wide <- readWQPdata(siteid = colorado$siteid, # pulling our siteids column in from the colorado object, becomes a vector + sampleMedia = "Water", # WQP also has bilogical data and sediment data + startDateLo = "1980-10-01", # must be formatted as YYYY-MM-DD + startDateHi = "2023-11-01", # must be formatted as YYYY-MM-DD + characteristicName = parameters) # our vector of `characteristcName`s +``` + +# Data tidying + +Now that we have downloaded the data, we need to tidy it up. The Water +Quality Portal data comes with an incredible amount of metadata in the +form of extra columns. But we don't need all this extra data. + +## Look at the data you downloaded + +```{r} +View(conc_wide) +``` + +## Initial cleaning up + +Wow, that looks messy! Lots of extraneous columns, lots of NAs, so much +information we can hardly parse it. Let's pare it down to the +essentials. + +```{r} +# This code mostly just grabs and renames the most important data columns +conc_small <- conc_wide %>% + select(date = ActivityStartDate, + parameter = CharacteristicName, + units = ResultMeasure.MeasureUnitCode, + siteid = MonitoringLocationIdentifier, + org = OrganizationFormalName, + org_id = OrganizationIdentifier, + time = ActivityStartTime.Time, + value = ResultMeasureValue, + sample_method = SampleCollectionMethod.MethodName, + analytical_method = ResultAnalyticalMethod.MethodName, + particle_size = ResultParticleSizeBasisText, + date_time = ActivityStartDateTime, + media = ActivityMediaName, + sample_depth = ActivityDepthHeightMeasure.MeasureValue, + sample_depth_unit = ActivityDepthHeightMeasure.MeasureUnitCode, + fraction = ResultSampleFractionText, + status = ResultStatusIdentifier) %>% + # Remove trailing white space in labels + mutate(units = trimws(units)) +``` + +Let's also add in the aliases we established in our`colorado` object, +and then add a new column, `ion`, that represents what each of our +`parameter`s are associated with. Next, let's make sure this information +lives at the beginning of the table: + +```{r} +conc_meta <- conc_small %>% + left_join(., colorado, by = "siteid") %>% + dplyr::mutate(ion = dplyr::case_when(parameter == "Calcium" ~ "Ca", + parameter == "Magnesium" ~ "Mg", + parameter == "Sodium" ~ "Na", + parameter == "Potassium" ~ "K", + parameter %in% c("Sulfate", "Sulfate as SO4", "Sulfur Sulfate", "Total Sulfate") ~ "SO4", + parameter == "Chloride" ~ "Cl", + parameter %in% c("Alkalinity, bicarbonate", "Bicarbonate") ~ "HCO3")) %>% + select(siteid, basin, ion, parameter, date, everything()) +``` + +Now let's look at this tidier version: + +```{r} +View(conc_meta) +``` + +## Final tidy data set + +That is getting better, but we still have lots of extraneous +information. For our purposes let's assume that the sample and analytic +methods used by the USGS are reasonable and exchangeable (i.e., one +method is equivalent to the other). If we make that assumption then the +only remaining data we need to clean is to make sure that all the data +has the same units. + +### Unit check + +```{r unit check} +table(conc_meta$units) +``` + +Wow! Almost all the data is in mg/L (or, depending on when you pulled +it, all of it). That makes our job really easy. + +We just need to remove these (potential) non-mg/L observations with a +`dplyr::filter` call and then select an even smaller subset of useful +columns, while adding a date object column using the `lubridate::ymd` +call. + +```{r tidy} +conc_tidy <- conc_meta %>% + filter(units == 'mg/l') %>% + mutate(date = ymd(date)) %>% + select(date, + parameter, + ion, + siteid, + basin, + conc = value) +``` + +### Daily data + +We now have a manageable data frame. But how do we want to organize the +data? Since we are looking at a really long time-series of data, let's +look at data as a daily average. The `dplyr::group_by` +`dplyr::summarize` functions make this really easy: + +```{r} +# The amazing group_by function groups all the data so that the summary +# only applies to each subgroup (siteid, date, and parameter combination). +# So in the end you get a daily average concentration for each siteid and parameter type. +conc_daily <- conc_tidy %>% + group_by(date, parameter, siteid, basin) %>% + summarize(conc = mean(conc, na.rm = T)) +``` + +Taking daily averages looks like it did eliminate +`r nrow(conc_tidy) - nrow(conc_daily)` observations, meaning these +site-date combinations had multiple observations on the same day. + +# Assignment! + +Let's imagine you wanted to add more data to your water quality +analyses, but you also know that you need to do this analysis over and +over again. Let's walk through how we would: 1) Add new data to our +current `conc_tidy` data set and 2) how to write a function to download, +clean, and update our data with far less code. + +## Question 1 + +Write a function that can repeat the above steps for any table of site +IDs with a single function call. This function should take in a single +tibble that is identical in structure to the `colorado` one above (i.e., +it has columns named `siteid` and `basin`), and produce an identical +data frame to `conc_daily` (i.e., a data frame of daily average ion +concentrations). + +```{r} +tidier_function <- function(x){ +parameters <- c("Calcium", "Magnesium", "Sodium", "Potassium", "Sulfate", "Sulfate as SO4", "Sulfur Sulfate", "Total Sulfate", "Chloride", "Alkalinity, bicarbonate", "Bicarbonate") + conc_wide <- readWQPdata(siteid = x$siteid, # pulling our siteids column in from the colorado object, becomes a vector + sampleMedia = "Water", # WQP also has bilogical data and sediment data + startDateLo = "1980-10-01", # must be formatted as YYYY-MM-DD + startDateHi = "2023-11-01", # must be formatted as YYYY-MM-DD + characteristicName = parameters) # our vector of `characteristcName`s + conc_small <- conc_wide %>% + select(date = ActivityStartDate, + parameter = CharacteristicName, + units = ResultMeasure.MeasureUnitCode, + siteid = MonitoringLocationIdentifier, + org = OrganizationFormalName, + org_id = OrganizationIdentifier, + time = ActivityStartTime.Time, + value = ResultMeasureValue, + sample_method = SampleCollectionMethod.MethodName, + analytical_method = ResultAnalyticalMethod.MethodName, + particle_size = ResultParticleSizeBasisText, + date_time = ActivityStartDateTime, + media = ActivityMediaName, + sample_depth = ActivityDepthHeightMeasure.MeasureValue, + sample_depth_unit = ActivityDepthHeightMeasure.MeasureUnitCode, + fraction = ResultSampleFractionText, + status = ResultStatusIdentifier) %>% + # Remove trailing white space in labels + mutate(units = trimws(units)) + conc_meta <- conc_small %>% + left_join(., x, by = "siteid") %>% + dplyr::mutate(ion = dplyr::case_when(parameter == "Calcium" ~ "Ca", + parameter == "Magnesium" ~ "Mg", + parameter == "Sodium" ~ "Na", + parameter == "Potassium" ~ "K", + parameter %in% c("Sulfate", "Sulfate as SO4", "Sulfur Sulfate", "Total Sulfate") ~ "SO4", + parameter == "Chloride" ~ "Cl", + parameter %in% c("Alkalinity, bicarbonate", "Bicarbonate") ~ "HCO3")) %>% + select(siteid, basin, ion, parameter, date, everything()) + conc_tidy <- conc_meta %>% + filter(units == 'mg/l') %>% + mutate(date = ymd(date)) %>% + select(date, + parameter, + ion, + siteid, + basin, + conc = value) + conc_daily <- conc_tidy %>% + group_by(date, parameter, siteid, basin) %>% + summarize(conc = mean(conc, na.rm = T))} +``` + +## Question 2 + +Using the function you developed in Question 1, download and clean water +quality data for the site IDs listed below: + +```{r} +additional_data <- tibble(siteid = c('USGS-09180000', 'USGS-09180500', 'USGS-09380000'), + basin = c('dolores', 'colorado4', 'colorado5')) +``` + +```{r} +test <- tidier_function(additional_data) +``` + +This output of running the function should look identical in format to +our original `conc_daily` data set, but for these sites instead. + +## Question 3 + +Combine the data pulled in Question 2 with the original data from +`conc_daily`, so that this data is in a single data frame. Save this +combined data as `tidied_full_wq.RDS` in a folder called data. + +```{r} + +tidied_full_wq.RDS <- bind_rows(conc_daily, test) + +saveRDS(tidied_full_wq.RDS, file = 'data/tidied_full_wq.RDS') +``` + +## Question 4 + +We now have a data set of stream water quality data for several sites +throughout Colorado. One potential control on stream chemistry is stream +discharge. A function in the {dataRetrieval} package that allows you to +easily download discharge data is `readNWISdv()`. Use this function to +download daily discharge data for all eight of the sites we are +interested in. Save the data as an RDS object called called `Q` in the +data folder: `data/Q.RDS`. The site numbers are the same as above but +you need to remove `USGS-` from each site. Discharge is +`parameterCd = 00060` and you should use `renameNWISColumns()` to +automatically make the column names a little less annoying. + +```{r} +daily_discharge <- tibble(siteNumbers = c('09180000', '09180500', '09380000', "09034500", "09069000", + "09085000", "09095500", "09152500")) +daily_wide <- readWISdv(siteNumbers = daily_discharge$siteNumbers, parametersCD = 0060, statCD = "00003" +``` + +```{r} +daily_discharge <- readNWISdv(siteNumbers = c('09180000', '09180500', '09380000', "09034500", "09069000", + "09085000", "09095500", "09152500"), parameterCD = 00060, statCd = "0003" +``` + +```{r} +# Reminder! you can use ?readNWISdv to read about how the function works. +sites <- colorado %>% + #Bind the two datasets to get all 8 sites + bind_rows(additional_data) %>% + #Grab just the column labeled sites + pull(siteid) %>% + #Remove the USGS- prefix + gsub('USGS-', '', .) +readNWISdv(siteNumbers = "siteid", parameterCd = 00060, statCd = "00003") + +#<<<< YOUR CODE HERE >>>># + + +#saveRDS(q_data, file = 'data/Q.RDS') +``` diff --git a/02_Modelling_WQ_data_Shelhon.Rmd b/02_Modelling_WQ_data_Shelhon.Rmd new file mode 100644 index 0000000..47edd78 --- /dev/null +++ b/02_Modelling_WQ_data_Shelhon.Rmd @@ -0,0 +1,370 @@ +--- +title: "Modelling Public Water Quality Data" +author: "Matthew Ross" +date: "`r format(Sys.time(), '%B %d, %Y')`" +output: + html_document: + toc: yes + toc_depth: 3 + toc_float: true +editor_options: + chunk_output_type: console +--- + +```{r setup, warnings='hide, message=FALSE} +source('setup.R') +``` + +Now we have a 'tidy' data set from our previous lesson, this includes both discharge data and concentration data. Let's look at the data we have. First where is the data? + +The two datasets were saved as '.RDS' files. These are almost identical to '.RData' files, but unlike '.RData', '.RDS' cannot store multiple objects, so we used this file type to save a single data frame in each file. To save '.RDS' files we use the function `saveRDS()` and to read '.RDS' files we use `readRDS()`, and assign it to a new environmental variable. + +## Data Load + +```{r data readin} +# read in water quality data saved at the end of assignment 1 +wq <- readRDS('data/tidied_full_wq.RDS') + + +# create a tibble of site info we will need to use later +colorado <- tibble(siteid = c('USGS-09034500', 'USGS-09069000', + 'USGS-09085000', 'USGS-09095500', 'USGS-09152500'), + basin = c('colorado1', 'eagle', + 'roaring', 'colorado3', 'gunnison')) %>% + bind_rows(tibble(siteid = c('USGS-09180000', 'USGS-09180500', 'USGS-09380000'), + basin = c('dolores', 'colorado4', 'colorado5'))) +``` + +## Site info extraction + +We can get site geospatial information for sites within the Water Quality Portal with the `whatWQPdata()` function. Then, we can reduce and rename columns much like we did in the previous assignment (for our own clarity): + +```{r} +site_info <- whatWQPsites(siteid=unique(wq$siteid)) %>% + dplyr::select(siteid = MonitoringLocationIdentifier, + name = MonitoringLocationName, + area = DrainageAreaMeasure.MeasureValue, + area.units = DrainageAreaMeasure.MeasureUnitCode, + elev = VerticalMeasure.MeasureValue, + elev_units = VerticalMeasure.MeasureUnitCode, + lat = LatitudeMeasure, + long = LongitudeMeasure) %>% + distinct() %>% # Distinct just keeps the first of any duplicates. + inner_join(colorado, by = "siteid") +``` + +### Map + +Here we use the `sf` package to project the site information data into a geospatial object called a simple feature, or `sf`. The function `st_as_sf` converts the longitude (x) and latitude (y) coordinates into a projected point feature with the EPSG code 4326 (WGS 84). We can then use the `mapview` package and function to look at where these sites are. + +```{r} +# convert site info into an sf object +site_sf <- site_info %>% + st_as_sf(., coords = c('long', 'lat'), crs = 4326) + +mapview(site_sf) +``` + +So these sites are generally in the Colorado River Basin with increasing watershed size (denoted by 'area' in the point pop-up window). + +# Modelling Data + +## Trend detection? + +Now that we know where the data is coming from let's start modelling! The first question we might want to explore is: **Are concentrations of elements changing over time?**. Let's first focus on Calcium in the Dolores River. As with all data work, the first thing you should do is look at your data. + +```{r} +dolores_ca <- wq %>% + filter(basin == 'dolores', parameter == 'Calcium') + +ggplot(dolores_ca, aes(x = date, y = conc)) + + geom_point() +``` + +## Adding a trend line with ggplot + +`ggplot` has an easy method for adding a trend line to plots (`stat_smooth`). The code below uses a linear model to fit the line: + +```{r} +ggplot(dolores_ca, aes(x = date, y = conc)) + + geom_point() + + stat_smooth(method = 'lm') +``` + +... That line looks pretty flat! + +### Linear Models for Trend Detection (the wrong way..). + +A very intuitive way to try to detect if there is a long term trend is to use linear models as `ggplot` does. So let's go ahead and write out a model for daily Calcium data using the `lm` function, specifying a model where concentration (`conc`) varies by (`~`) date (`date`). + +```{r} +ca_model <- lm(conc ~ date, data = dolores_ca) +summary(ca_model) +``` + +### The right way! + +Using a linear model for trend detection breaks one of the cardinal rules of linear modelling, namely that each observation is **assumed to be independent of any other observation**. In a time-series like what we are looking at here, yesterday's Calcium concentration is deeply related to today's concentration. So linear models should **never** be used in trend detection on time series data. Instead, we should use the Mann-Kendall tests and Tau's Sens Slope. + +#### Mann-Kendall test + +The Mann Kendall test is a non-parametric test of trends, you can use `?mk.test` to read more about the method, but it only requires an ordered time-series to run. Let's use it here. + +```{r} +?mk.test +``` + +```{r} +dolores_ca <- dolores_ca %>% + #Make sure data is arranged by date using `arrange()` + arrange(date) + +dolores_mk <- mk.test(dolores_ca$conc) + +print(dolores_mk) +#p-value = 0.1789, not significant +``` + +The Mann Kendall test is really just a true/false where if the p-value is below some threshold (usually 0.05) then you can be mostly confident that there is a 'real' trend in the data. However it doesn't tell you the slope of that trend. For that you need to use `sens.slope`. + +```{r} +dolores_slope <- sens.slope(dolores_ca$conc) + +dolores_slope +#Sen's slope = 0.0265165, p-value same = 0.1789, not significant +``` + +Notice that the sens.slope gives you a slope value, and a p-value (which is the same p-value found in the Mann-Kendall test). For this reason, it is almost always easier to just use `sens.slope` to get both significance and slope. + +#### Cleaner output + +The output from these models is not organized very nicely. We can use the `tidy()` function from the `broom` package to clean up this output and convert the information into a data frame, with a column for each value/parameter (similar to the outputs from tests used in the `rstatix` package). + +```{r} +tidy(dolores_slope) +``` + +Some model objects that get converted using the tidy() function don't include both the p-value and the slope, which is slightly maddening, but we can make our own function to do all of this, including running the model: + +```{r} +tidier_sens <- function(data){ + + model <- sens.slope(data) + + tidy(model) %>% + mutate(slope = model$estimates) + + } + +tidier_sens(data = dolores_ca$conc) +``` + +We now have a statistical confirmation of what the plot already showed us. There is no long-term trend in Calcium concentrations in the Dolores River (denoted by the high p-value, much greater than our usual 0.05 alpha/cut-off). + +# Models everywhere! + +We now know how to model data at a single site for a single parameter, but is there an efficient way to do this for ALL sites and ALL parameters? + +HECK YES THERE IS! + +We will use the magic of `nesting` data to apply our trend models to all of our parameters and sites. First let's alter the data set a little to increase precision in our question. + +### Converting data to late summer annual means + +Water chemistry is heavily controlled by seasonality and water flow, so let's try to control for that and summarize our data to only include the low-flow periods of the year. Basically we will be focusing on: **are there trends in low flow concentrations of ions in the stream?** + +```{r} +low_flow <- wq %>% + mutate(month = month(date), + year = year(date)) %>% # create columns of just month and year + filter(month %in% c(8,9,10,11)) %>% #filter later summer months + group_by(basin, siteid, parameter, year) %>% + summarize(conc = median(conc, na.rm = T)) %>%# calculate annual conc for each site/parameter pair + arrange() + +ggplot(low_flow, aes(x = year, y = conc, color = basin)) + + facet_wrap(~parameter, scales = 'free') + + geom_point() + + theme_minimal() + + scale_y_log10() + + theme(legend.pos = c(0.7, 0.2), + legend.direction = 'horizontal') + + ylab('Concentration (mg/l)') +``` + +## The Magic of nesting + +Now we have a few things: + +1. A data set that is winnowed down to just low-flow periods of the year + +2. A function (`tidier_sens`) we can use to look at if there are long-term trends in concentration with Sens slope, then convert the Sens slope output to a data frame + +3. A desire to apply this function to all of our sites and water quality parameters + +To accomplish step three, we need to use the magic of `nest()`. Nesting allows us to group data by site and parameter (like with a `group_by` and a `summarize`) and apply models to each site and parameter separately. Effectively nesting bundles (... or nests!) the data into tidy little packets that we can apply the model too. Let's try! + +### Nesting data + +```{r} +low_nest <- low_flow %>% + #rename parameter as characteristic... model output already has "parameter" as a column name + group_by(characteristic = parameter, basin) %>% # rename 'parameter' + nest() + +low_nest +``` + +The above code produces a tibble with three columns: `basin`, `parameter`, and `data`. The `data` column is our nested (or bundled) data for each basin parameter combination. We know this by the '*list*' data type printed under the 'data' column and each row has a 'tibble' nested within it. + +For example, to retrieve one of those nested data frames: + +```{r} +low_nest$data[[1]] # subset my low_nest data to just the data column, then select just the first nested tibble +``` + +### Modelling over nested data + +Now we want to apply our model to each nested data frame. To do this we need to use the `map()` function. Map takes in an x (`data` column) and then a function (in this case `sens.slope`). We use `.x$conc` to indicate that we want to apply the model to the concentration column within each bundled (nested) data frame. + +```{r} +wq_models <- low_nest %>% + mutate(tidy_mods = map(data, ~ tidier_sens(.x$conc))) # create a new column to store the model results of each row/dataset + +wq_models +``` + +Now we have a nested data set AND nested models (that are hard to see). We can look at a single model by indexing it: + +```{r} +# This provides the 15th model summary +wq_models$tidy_mods[[15]] +``` + +But that is a tedious way to look at our model summaries! + +Instead, we can use the power of our `tidier()` function we made upstream, and `unnest()`. Again we use `map()` to apply our `tidier` function to all of the raw `sens.slope` models and we extract p.value and slope in a clean table. We then use `unnest()` to unravel that data so we have a final data frame that contains model outputs. + +```{r} +wq_mod_summaries <- wq_models %>% + unnest(tidy_mods) %>% # separates the nested column into individual columns for each value + select(basin, characteristic, p.value, slope) %>% + mutate(trend = ifelse(p.value < 0.01, 'yes', 'no')) # create a column telling us whether or not there was a significant trend based on a p-value cut-off of 0.01 + +wq_mod_summaries +``` + +### Visualizing model output + +```{r} +ggplot(wq_mod_summaries,aes(x = characteristic, y = slope, color = trend)) + + geom_point() + + facet_wrap(~basin, scales = 'free') + + theme_minimal() + + scale_color_manual(values = c('black','green3')) + + theme(axis.text.x = element_text(angle = 45, hjust = 1), + legend.pos = c(0.8, 0.1)) +``` + +# Assignment + +The above workflow really focuses on trend detection with Sens slope, but here we will focus on an appropriate use of linear models. As such we want to join our discharge data to our water quality data and we want to look at the relationship between Q and WQ. + +## Join Discharge and Water Quality Data + +Use `inner_join` to join our daily discharge data (`Q.RDS`) to our raw water quality data (`tidied_full_wq.RDS`). You want to join by both date and siteid. Remember! the discharge data has site IDs that we had to drop the `USGS-` from, so you will need to add that back in using `paste0`. + +```{r} +Q = readRDS("data/Q.RDS") +``` + +```{r} +Q_fix <- rename(Q, siteid = site_no) +``` + +```{r} +Q_fix2<- Q_fix%>% + mutate(siteid = paste0("USGS-", siteid)) +``` + +```{r} +Q_fix2 <- Q_fix2%>% + rename(date = Date) +``` + +```{r} +#need to paste USGS- into Q and in Q the column for site is site_no +Q_tidied <- inner_join(Q_fix2, tidied_full_wq.RDS, by = c("date", "siteid")) + +``` + +### Pick any site and ion combination and plot discharge versus ion concentration + +```{r} + +roaring_su <- Q_tidied %>% + filter(basin == 'roaring', parameter == 'Sulfate') + +ggplot(roaring_su, aes(x = Flow, y = conc)) + + geom_point() + +``` + +#### What do you see in this relationship? + +#concentration decreases with flow rate, is this called a negative relationship when it rises on the left side of the screen + +## Models everywhere + +Group your data by basin and water quality parameter and nest the data. + +```{r} + +basin_wq <- roaring_su %>% + + group_by(parameter, basin) %>% + nest() + +basin_wq + +``` + +## Apply a linear model to the data + +You will need to use a `map` command like this: `map(data, ~lm(conc ~ q, data = .x))` + +```{r} + +basin_wq2 <- basin_wq %>% + mutate(model = map(data, ~ lm(conc ~ q, data = .))) +#nope, error, "q" is not a column name and I am unsure if I use flow or ? to compare, flow also didn't work, not sure if the issue is upstream or? + +``` + +## Summarize your data using `tidy` + +You should have a new column called `mods` or something similar, and you need to `tidy` those mods and store this new, tidier data in another column. + +```{r} + +#<<<< YOUR CODE HERE >>>># + +#can't accomplish the code before this to get here. +``` + +## Make a visual of your models' summaries that shows both which sites have significant relationships between discharge and concentration and the slope of that relationship. + +```{r} + +#<<<< YOUR CODE HERE >>>># + +#can't get the steps complete to get here. +``` + +## Bonus + +Look up the `furrr` package. What does `furrr::map` do that is different from `purrr::map`? + +When would you want to use this `furrr` function? + +#furr::map runs faster because it runs in parrallel according to the interwebs, but I think having been able to complete the two questions above with assistance would have helped be understand differences in this lesson, in particular. diff --git a/data/tidied_full_wq.RDS b/data/tidied_full_wq.RDS index 75a045d..d6a48c1 100644 Binary files a/data/tidied_full_wq.RDS and b/data/tidied_full_wq.RDS differ