forked from Env-Data-Sci-FA23/Week-5-Nested-Modelling
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_Modelling_WQ_data_Howlett.Rmd
358 lines (238 loc) · 13.7 KB
/
02_Modelling_WQ_data_Howlett.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
---
title: "Modelling Public Water Quality Data"
author: "Matthew Ross"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
toc: yes
toc_depth: 3
toc_float: true
editor_options:
chunk_output_type: console
---
```{r setup, warnings='hide, message=FALSE}
source('setup.R')
```
Now we have a 'tidy' data set from our previous lesson, this includes both discharge data and concentration data. Let's look at the data we have. First where is the data?
The two datasets were saved as '.RDS' files. These are almost identical to '.RData' files, but unlike '.RData', '.RDS' cannot store multiple objects, so we used this file type to save a single data frame in each file. To save '.RDS' files we use the function `saveRDS()` and to read '.RDS' files we use `readRDS()`, and assign it to a new environmental variable.
## Data Load
```{r data readin}
# read in water quality data saved at the end of assignment 1
wq <- readRDS('data/tidied_full_wq.RDS')
# create a tibble of site info we will need to use later
colorado <- tibble(siteid = c('USGS-09034500', 'USGS-09069000',
'USGS-09085000', 'USGS-09095500', 'USGS-09152500'),
basin = c('colorado1', 'eagle',
'roaring', 'colorado3', 'gunnison')) %>%
bind_rows(tibble(siteid = c('USGS-09180000', 'USGS-09180500', 'USGS-09380000'),
basin = c('dolores', 'colorado4', 'colorado5')))
```
## Site info extraction
We can get site geospatial information for sites within the Water Quality Portal with the `whatWQPdata()` function. Then, we can reduce and rename columns much like we did in the previous assignment (for our own clarity):
```{r}
site_info <- whatWQPsites(siteid=unique(wq$siteid)) %>%
dplyr::select(siteid = MonitoringLocationIdentifier,
name = MonitoringLocationName,
area = DrainageAreaMeasure.MeasureValue,
area.units = DrainageAreaMeasure.MeasureUnitCode,
elev = VerticalMeasure.MeasureValue,
elev_units = VerticalMeasure.MeasureUnitCode,
lat = LatitudeMeasure,
long = LongitudeMeasure) %>%
distinct() %>% # Distinct just keeps the first of any duplicates.
inner_join(colorado, by = "siteid")
```
### Map
Here we use the `sf` package to project the site information data into a geospatial object called a simple feature, or `sf`. The function `st_as_sf` converts the longitude (x) and latitude (y) coordinates into a projected point feature with the EPSG code 4326 (WGS 84). We can then use the `mapview` package and function to look at where these sites are.
```{r}
# convert site info into an sf object
site_sf <- site_info %>%
st_as_sf(., coords = c('long', 'lat'), crs = 4326)
mapview(site_sf)
```
So these sites are generally in the Colorado River Basin with increasing watershed size (denoted by 'area' in the point pop-up window).
# Modelling Data
## Trend detection?
Now that we know where the data is coming from let's start modelling! The first question we might want to explore is: **Are concentrations of elements changing over time?**. Let's first focus on Calcium in the Dolores River. As with all data work, the first thing you should do is look at your data.
```{r}
dolores_ca <- wq %>%
filter(basin == 'dolores', parameter == 'Calcium')
ggplot(dolores_ca, aes(x = date, y = conc)) +
geom_point()
```
## Adding a trend line with ggplot
`ggplot` has an easy method for adding a trend line to plots (`stat_smooth`). The code below uses a linear model to fit the line:
```{r}
ggplot(dolores_ca, aes(x = date, y = conc)) +
geom_point() +
stat_smooth(method = 'lm')
```
... That line looks pretty flat!
### Linear Models for Trend Detection (the wrong way..).
A very intuitive way to try to detect if there is a long term trend is to use linear models as `ggplot` does. So let's go ahead and write out a model for daily Calcium data using the `lm` function, specifying a model where concentration (`conc`) varies by (`~`) date (`date`).
```{r}
ca_model <- lm(conc ~ date, data = dolores_ca)
summary(ca_model)
```
### The right way!
Using a linear model for trend detection breaks one of the cardinal rules of linear modelling, namely that each observation is **assumed to be independent of any other observation**. In a time-series like what we are looking at here, yesterday's Calcium concentration is deeply related to today's concentration. So linear models should **never** be used in trend detection on time series data. Instead, we should use the Mann-Kendall tests and Tau's Sens Slope.
#### Mann-Kendall test
The Mann Kendall test is a non-parametric test of trends, you can use `?mk.test` to read more about the method, but it only requires an ordered time-series to run. Let's use it here.
```{r}
dolores_ca <- dolores_ca %>%
#Make sure data is arranged by date using `arrange()`
arrange(date)
dolores_mk <- mk.test(dolores_ca$conc)
print(dolores_mk)
```
The Mann Kendall test is really just a true/false where if the p-value is below some threshold (usually 0.05) then you can be mostly confident that there is a 'real' trend in the data. However it doesn't tell you the slope of that trend. For that you need to use `sens.slope`.
```{r}
dolores_slope <- sens.slope(dolores_ca$conc)
dolores_slope
```
Notice that the sens.slope gives you a slope value, and a p-value (which is the same p-value found in the Mann-Kendall test). For this reason, it is almost always easier to just use `sens.slope` to get both significance and slope.
#### Cleaner output
The output from these models is not organized very nicely. We can use the `tidy()` function from the `broom` package to clean up this output and convert the information into a data frame, with a column for each value/parameter (similar to the outputs from tests used in the `rstatix` package).
```{r}
tidy(dolores_slope)
```
Some model objects that get converted using the tidy() function don't include both the p-value and the slope, which is slightly maddening, but we can make our own function to do all of this, including running the model:
```{r}
tidier_sens <- function(data){
model <- sens.slope(data)
tidy(model) %>%
mutate(slope = model$estimates)
}
tidier_sens(data = dolores_ca$conc)
```
We now have a statistical confirmation of what the plot already showed us. There is no long-term trend in Calcium concentrations in the Dolores River (denoted by the high p-value, much greater than our usual 0.05 alpha/cut-off).
# Models everywhere!
We now know how to model data at a single site for a single parameter, but is there an efficient way to do this for ALL sites and ALL parameters?
HECK YES THERE IS!
We will use the magic of `nesting` data to apply our trend models to all of our parameters and sites. First let's alter the data set a little to increase precision in our question.
### Converting data to late summer annual means
Water chemistry is heavily controlled by seasonality and water flow, so let's try to control for that and summarize our data to only include the low-flow periods of the year. Basically we will be focusing on: **are there trends in low flow concentrations of ions in the stream?**
```{r}
low_flow <- wq %>%
mutate(month = month(date),
year = year(date)) %>% # create columns of just month and year
filter(month %in% c(8,9,10,11)) %>% #filter later summer months
group_by(basin, siteid, parameter, year) %>%
summarize(conc = median(conc, na.rm = T)) %>%# calculate annual conc for each site/parameter pair
arrange()
ggplot(low_flow, aes(x = year, y = conc, color = basin)) +
facet_wrap(~parameter, scales = 'free') +
geom_point() +
theme_minimal() +
scale_y_log10() +
theme(legend.pos = c(0.7, 0.2),
legend.direction = 'horizontal') +
ylab('Concentration (mg/l)')
```
## The Magic of nesting
Now we have a few things:
1. A data set that is winnowed down to just low-flow periods of the year
2. A function (`tidier_sens`) we can use to look at if there are long-term trends in concentration with Sens slope, then convert the Sens slope output to a data frame
3. A desire to apply this function to all of our sites and water quality parameters
To accomplish step three, we need to use the magic of `nest()`. Nesting allows us to group data by site and parameter (like with a `group_by` and a `summarize`) and apply models to each site and parameter separately. Effectively nesting bundles (... or nests!) the data into tidy little packets that we can apply the model too. Let's try!
### Nesting data
```{r}
low_nest <- low_flow %>%
#rename parameter as characteristic... model output already has "parameter" as a column name
group_by(characteristic = parameter, basin) %>% # rename 'parameter'
nest()
low_nest
```
The above code produces a tibble with three columns: `basin`, `parameter`, and `data`. The `data` column is our nested (or bundled) data for each basin parameter combination. We know this by the '*list*' data type printed under the 'data' column and each row has a 'tibble' nested within it.
For example, to retrieve one of those nested data frames:
```{r}
low_nest$data[[1]] # subset my low_nest data to just the data column, then select just the first nested tibble
```
### Modelling over nested data
Now we want to apply our model to each nested data frame. To do this we need to use the `map()` function. Map takes in an x (`data` column) and then a function (in this case `sens.slope`). We use `.x$conc` to indicate that we want to apply the model to the concentration column within each bundled (nested) data frame.
```{r}
wq_models <- low_nest %>%
mutate(tidy_mods = map(data, ~ tidier_sens(.x$conc))) # create a new column to store the model results of each row/dataset
wq_models
```
Now we have a nested data set AND nested models (that are hard to see). We can look at a single model by indexing it:
```{r}
# This provides the 15th model summary
wq_models$tidy_mods[[15]]
```
But that is a tedious way to look at our model summaries!
Instead, we can use the power of our `tidier()` function we made upstream, and `unnest()`. Again we use `map()` to apply our `tidier` function to all of the raw `sens.slope` models and we extract p.value and slope in a clean table. We then use `unnest()` to unravel that data so we have a final data frame that contains model outputs.
```{r}
wq_mod_summaries <- wq_models %>%
unnest(tidy_mods) %>% # separates the nested column into individual columns for each value
select(basin, characteristic, p.value, slope) %>%
mutate(trend = ifelse(p.value < 0.01, 'yes', 'no')) # create a column telling us whether or not there was a significant trend based on a p-value cut-off of 0.01
wq_mod_summaries
```
### Visualizing model output
```{r}
ggplot(wq_mod_summaries,aes(x = characteristic, y = slope, color = trend)) +
geom_point() +
facet_wrap(~basin, scales = 'free') +
theme_minimal() +
scale_color_manual(values = c('black','green3')) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.pos = c(0.8, 0.1))
```
# Assignment
The above workflow really focuses on trend detection with Sens slope, but here we will focus on an appropriate use of linear models. As such we want to join our discharge data to our water quality data and we want to look at the relationship between Q and WQ.
## Join Discharge and Water Quality Data
Use `inner_join` to join our daily discharge data (`Q.RDS`) to our raw water quality data (`tidied_full_wq.RDS`). You want to join by both date and siteid. Remember! the discharge data has site IDs that we had to drop the `USGS-` from, so you will need to add that back in using `paste0`.
```{r}
Q2 <- mutate(Q, siteid = paste0("USGS-",site_no),
date = Date,
flow = X_00060_00003 ) %>%
select(-c(site_no, Date, X_00060_00003, X_00060_00003_cd))
joined_Q_wq <- inner_join(Q2, tidied_full_wq, by= c("siteid", "date"))
```
### Pick any site and ion combination and plot discharge versus ion concentration
```{r}
roaring_sodium <- joined_Q_wq %>%
filter(basin == "roaring", parameter == "Sodium")
ggplot(roaring_sodium) +
geom_line(aes(x=flow, y=conc))
```
#### What do you see in this relationship?
At low flows, there is a higher concentration of sodium, and it is less concentrated during higher flows.
## Models everywhere
Group your data by basin and water quality parameter and nest the data.
```{r}
Q_wq_nested <- joined_Q_wq %>%
group_by(basin, parameter) %>%
nest()
```
## Apply a linear model to the data
You will need to use a `map` command like this: `map(data, ~lm(conc ~ q, data = .x))`
```{r}
Q_wq_nested_lm <- Q_wq_nested %>%
mutate( mod = map(data, ~lm(conc ~ flow, data = .x)))
```
## Summarize your data using `tidy`
You should have a new column called `mods` or something similar, and you need to `tidy` those mods and store this new, tidier data in another column.
```{r}
#Not sure if this is correct
tidy_wq_lm <- Q_wq_nested_lm %>%
mutate(tidy_mods = map(mod, tidy)) %>%
unnest(cols = tidy_mods) %>%
select(basin, parameter, p.value, estimate) %>%
mutate(trend = ifelse(p.value < 0.01, 'yes', 'no'))
```
## Make a visual of your models' summaries that shows both which sites have significant relationships between discharge and concentration and the slope of that relationship.
```{r}
#not sure if the estimate column is the slope?
ggplot(tidy_wq_lm,aes(x = parameter, y = estimate, color = trend)) +
geom_point() +
facet_wrap(~basin, scales = 'free') +
theme_minimal() +
scale_color_manual(values = c("red",'black')) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.pos = c(0.8, 0.1))
```
## Bonus
Look up the `furrr` package. What does `furrr::map` do that is different from `purrr::map`?
When would you want to use this `furrr` function?