-
Notifications
You must be signed in to change notification settings - Fork 0
/
NEON-AIS-data-GLEON2022.Rmd
544 lines (417 loc) · 21.6 KB
/
NEON-AIS-data-GLEON2022.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
---
title: "Download and work with NEON Aquatic Instrument Data"
syncID: 317ecab8e00b4a959a76dba181bb33b8
description: Tutorial for downloading NEON AIS data using the neonUtilities package
and then exploring and understanding the downloaded data
dateCreated: '2022-10-21'
authors: Bobby Hensley
contributors: null
estimatedTime: 80 minutes
packagesLibraries: neonUtilities, ggplot2
topics: data-management, rep-sci
languageTool: R, API
code1: R/download-ais-data/download-NEON-AIS-data.R
tutorialSeries: null
urlTitle: explore-neon-ais-data
editor_options:
chunk_output_type: inline
---
This tutorial covers downloading NEON Aquatic Instrument System (AIS) data
using the neonUtilities R package, as well as basic instruction in beginning to
explore and work with the downloaded data. This includes navigating data
documentation, separating data using the horizontal and vertical location
location indices, variable,and interpreting quality flags.
<div id="ds-objectives" markdown="1">
## Objectives
After completing this activity, you will be able to:
* Download NEON AIS data using the `neonUtilities` package.
* Understand downloaded data sets and load them into R for analyses.
* Use the sensorPositions file to understand where data is being collected.
* Understand and interpret quality flags.
## Things You'll Need To Complete This Tutorial
To complete this tutorial you will need R (version >4.1) and,
preferably, RStudio loaded on your computer.
### Install R Packages
* **neonUtilities**: Basic functions for accessing NEON data
* **ggplot2**: Plotting functions
These packages are on CRAN and can be installed by
`install.packages()`.
### Additional Resources
* <a href="https://github.com/NEONScience/NEON-Utilities/neonUtilities" target="_blank">GitHub repository for neonUtilities</a>
</div>
## Download Files and Load Directly to R: loadByProduct()
The most popular function in `neonUtilities` is `loadByProduct()`.
This function downloads data from the NEON API, merges the site-by-month
files, and loads the resulting data tables into the R environment,
assigning each data type to the appropriate R class. This is a popular
choice because it ensures you're always working with the most up-to-date data,
and it ends with ready-to-use tables in R. However, if you use it in
a workflow you run repeatedly, keep in mind it will re-download the
data every time.
Before we get the NEON data, we need to install (if not already done) and load
the neonUtilities R package, as well as other packages we will use in the
analysis.
```{r set-up-env, eval=F}
# Install neonUtilities package if you have not yet.
install.packages("neonUtilities")
install.packages("ggplot2")
```
```{r load-packages}
# Set global option to NOT convert all character variables to factors
options(stringsAsFactors=F)
# Load required packages
library(neonUtilities)
library(ggplot2)
```
The inputs to `loadByProduct()` control which data to download and how
to manage the processing. The following are frequently used inputs:
* `dpID`: the data product ID, e.g. DP1.20288.001
The `dpID` is the data product identifier of the data you want to
download. It will be in the form DP#.#####.###. For this tutorial, we'll use some data
products collected in NEON's Aquatic Instrument System:
DP1.20288.001: Water quality
DP1.20264.001: Temperature at specific depths in surface water
DP1.20033.001: Nitrate in surface water
* `site`: defaults to "all", meaning all sites with available data;
To download data from a specific site, use the 4-letter NEON site code.
The site codes for the seven NEON lake sites are:
BARC = Barco Lake FL
SUGG = Suggs Lake FL
CRAM = Crampton Lake WI
LIRO = Little Rock Lake WI
PRLA = Prairie Lake ND
PRPO = Prairie Pothole ND
TOOK = Toolik Lake AK
* `startdate` and `enddate`: defaults to NA, meaning all dates
with available data; or a date in the form YYYY-MM, e.g.
2017-06. Since NEON data are provided in month packages, finer
scale querying is not available. Both start and end date are
inclusive.
* `package`: either basic or expanded data package. Expanded data
packages generally include additional information about data
quality, such as individual quality flag test results. Not every
NEON data product has an expanded package; if the expanded package
is requested but there isn't one, the basic package will be
downloaded.
* `check.size`: T or F; should the function pause before downloading
data and warn you about the size of your download? Defaults to T; if
you are using this function within a script or batch process you
will want to set this to F.
* `token`: this allows you to input your NEON API token to obtain faster
downloads.
Learn more about NEON API tokens in the <a href="https//:www.neonscience.org/neon-api-tokens-tutorial" target="_blank">**Using an API Token when Accessing NEON Data with neonUtilities** tutorial</a>.
There are additional inputs you can learn about in the
<a href="https//:www.neonscience.org/neonDataStackR" target="_blank">**Use the neonUtilities R Package to Access NEON Data** tutorial</a>.
In this exercise, we will want data from only one NEON field site,
Little Rock Lake (LIRO) from August-October 2021 as the lake turns
over in the autumn.
Now let us download our data. If you are not using a NEON token to download
your data, neonUtilities will ignore the `token` input. Since we are only
downloading three months of data, we can set `check.size = F'. But for
large data downloads you may want to check your file size first.
```{r download-data-waq, results='hide'}
# download data of interest - Water Quality
waq <- loadByProduct(dpID="DP1.20288.001", site="LIRO",
startdate="2021-08", enddate="2021-10",
package="expanded",
token = Sys.getenv("NEON_TOKEN"),
check.size = F)
```
Using what you've learned above, can you modify the code to download data for:
DP1.20264.001 - Temperature at specific depths
DP1.20033.001 - Nitrate in surface water
```{r download-data-nsw, results='hide'}
# download data of interest - Temperature at specific depths
tsd <- loadByProduct(dpID="DP1.20264.001", site="LIRO",
startdate="2021-08", enddate="2021-10",
package="expanded",
token = Sys.getenv("NEON_TOKEN"),
check.size = F)
nsw <- loadByProduct(dpID="DP1.20033.001", site="LIRO",
startdate="2021-08", enddate="2021-10",
package="expanded",
token = Sys.getenv("NEON_TOKEN"),
check.size = F)
```
## Files Associated with Downloads
The data we've downloaded comes as an object that is a named list of objects.
To work with each of them, select them from the list using the `$` operator.
```{r loadBy-list-names}
# view all components of the list
names(waq)
```
We can see that there are multiple objects in the downloaded water quality data. One
dataframe of data (`waq_instantaneous`) and six metadata files. Now lets view
'waq_instantaneous'.
```{r loadBy-list-view}
# View the dataFrame
View(waq$waq_instantaneous)
```
If you'd like you can use the `$` operator to assign an object from an item in
the list. If you prefer to extract each table from the list and work with it as
independent objects, which we will do, you can use the `list2env()` function.
```{r unlist-vars}
# unlist the variables and add to the global environment
list2env(waq, .GlobalEnv)
```
So what exactly are these files and why would you want to use them?
* **data file(s)**: There will always be one or more dataframes that include the
primary data of the data product you downloaded. Multiple dataframes are
available when there are related datatables for a single data product.
* **readme_xxxxx**: The readme file, with the corresponding 5 digits from the
data product number, provides you with important information relevant to the
data product.
* **sensor_postions_xxxxx**: this file contains information about the coordinates
of each sensor, relative to a reference location.
* **variables_xxxxx**: this file contains all the variables found in the
associated data table(s). This includes full definitions, units, and other
important information.
* **issueLog_xxxxx**: this file contains information about issues identified
for the particular data product.
There are also two files specific to water quality:
* **ais_maintenance**: this file contains information about field maintenance,
for example when cleanings and calibrations were performed.
* **ais_multisondeCleanCal**: this file contains information from the cleanings
and claibrations, for example what the sensor was reading in standards before
and after. This can be useful in performing drift correction or adjusting
calibration offsets.
Let's perform the same thing for the temperature at specific depths and
nitrate in surface water data products too:
```{r unlist-remainder}
list2env(tsd, .GlobalEnv)
list2env(nsw, .GlobalEnv)
```
Note that a few more objects were added to the Global Environment, including:
* `TSD_1_min`
* `TSD_30_min`
* `NSW_15_minute`
The `30_min` in the name indicates the different time-averaging intervals in
a dataset. If only one time average interests you, you may specify the time
interval of interest using the `avg`input when downloading the data using
`neonUtilities::loadByProduct()`.
Note that in addition to the data files, these data products also include
separate `readme`, `sensor_position` and `variables` files.
## A note about sensor positions
NEON often collects the same type of data from sensors in different locations.
These data are delivered together but you will frequently want to plot the data
separately or only include data from one sensor in your analysis. NEON uses the
`horizontalPosition` variable in the data tables to describe which sensor
data is collected from. The `horizontalPosition` is always a three digit number
for AIS data. Non-shoreline HOR examples as of 2022 at AIS sites include:
* 101: stream sensors located at the **upstream** station on a **monopod mount**,
* 111: stream sensors located at the **upstream** station on an **overhead cable mount**,
* 131: stream sensors located at the **upstream** station on a **stand alone pressure transducer mount**,
* 102: stream sensors located at the **downstream** station on a monopod mount,
* 112: stream sensors located at the **downstream** station on an **overhead cable mount**
* 132: stream sensors located at the **downstream** station on a **stand alone pressure transducer mount**,
* 110: **pressure transducers** mounted to a **staff gauge**.
* 103: sensors mounted on **buoys in lakes or rivers**
* 130 and 140: sensors mounted in the **littoral zone** of lakes
Let's see what horizontalPositions are found in the data we just downloaded.
```{r unique-HOR}
# Determine what unique horizontalPositions are in dataset
print("Water quality horizontal positions:")
unique(waq_instantaneous$horizontalPosition)
```
Out site, LIRO, is a lake site, where water quality is measured from a buoy
moored in the deepest part.
## Plot dissolved oxygen data
First, let's identify the column names important for plotting - time and
dissolved oxygen data. (We can do this for just the original dataframe,
since all of the subsetted dataframes will have the same column names).
There are several ways of doing this:
```{r column-names}
# One option is to view column names in the data frame
colnames(waq_instantaneous)
# Alternatively, view the variables file
View(variables_20288)
```
Quite a few columns in the water quality data product!
The time column we'll consider for instrumented systems is `startDateTime`
because it approximately represents data within the interval on or before the
`startDateTime` time stamp. Timestamp column choice matters for time-aggregated
datasets, but should not matter for instantaneous data such as water quality.
When interpreting data, keep in mind NEON timestamps are always in UTC.
The data column we would like to plot is labeled `dissolvedOxygen`.
The deeper NEON lakes (CRAM, LIRO and TOOK) are equipped with profiling
multisondes which collect vertical profiles of the water column every four
hours. A pressure transducer is used to calculate the `sensorDepth` of each
measurement. For plotting, we can bin these into 1 meter increments.
```{r site-timezone-info}
# Just a little snippet if anyone wants to convert to local time
# NEON data downloads are all in UTC timezone, unless otherwise specified
devtools::install_github(repo = "NEONScience/NEON-geolocation",
subdir = "geoNEON")
siteLocInfo <- geoNEON::getLocBySite("LIRO",
type = "site",
history = F)
localTZ <- siteLocInfo$siteTimezone
waq_instantaneous$localStartTime <- format(waq_instantaneous$startDateTime,"%Y-%m-%dT%H:%M",tz = localTZ)
```
```{r plot-waq}
# create discrete color bins by rounding to the nearest meter
waq_instantaneous$sensorDepthRounded<-as.factor(floor(waq_instantaneous$sensorDepth))
# plot
plot_DO <- ggplot()+
geom_point(data=waq_instantaneous,aes(x=startDateTime,y=dissolvedOxygen,color=sensorDepthRounded),na.rm=TRUE) +
ylim(0, 9) + ylab("DO (mg/L)") + xlab(" ") +
ggtitle("Little Rock Lake - Dissolved Oxygen")
plot_DO
```
## Plot water temperature data
Now let's try plotting temperature. Because the temperature chain is static,
each of the thermistors has their own horizontalPosition with a known depth.
These can be found both in the data table, and the sensor_positions file.
```{r view-position}
# View sensor positions file
View(sensor_positions_20264)
```
Now we can plot temperature in a similar manner to dissolved oxygen.
```{r plot-temp}
# plot
plot_Temp <- ggplot()+
geom_line(data=TSD_30_min,aes(x=endDateTime,y=tsdWaterTempMean,color=verticalPosition),na.rm=TRUE) +
ylim(9, 27) + ylab("Temp (C)") + xlab(" ") +
ggtitle("Little Rock Lake - Temperature")
plot_Temp
```
## Plot nitrate data with uncertainty
Most NEON data products are published with uncertainty values. Lets try plotting
the nitrate in surface water data with uncertainty bands added.
```{r plot-uncert}
# plot
plot_nsw <- ggplot() +
geom_line(data = NSW_15_minute,
aes(startDateTime, surfWaterNitrateMean),
na.rm=TRUE, color="black") +
geom_ribbon(data=NSW_15_minute,
aes(x=startDateTime,
ymin = (surfWaterNitrateMean-surfWaterNitrateExpUncert),
ymax = (surfWaterNitrateMean+surfWaterNitrateExpUncert)),
alpha = 0.4, fill = "pink") +
geom_line( na.rm = TRUE) +
ylim(1, 5) + ylab("NO3-N (uM)") +
xlab(" ") +
ggtitle("Little Rock Lake - Nitrate w Uncertainty")
plot_nsw
```
## Examine Quality Flagged Data
Data product quality flags fall under two distinct types:
* Automated quality flags, e.g. range, spike, step, null
* Manual science review quality flag
In instantaneous data such as water quality DP1.20288.001,
the quality flag columns are denoted with "QF".
In time-averaged data, most quality flags have been aggregated into
quality metrics, with column names denoted with "QM" representing
the fraction of flagged points within the time averaging window.
```{r view-qf}
waq_qf_names <- names(waq_instantaneous)[grep("QF", names(waq_instantaneous))]
print(paste0("Total columns in DP1.20288.001 expanded package = ",
as.character(length(waq_qf_names))))
# so let's just look at those corresponding to dissolved oxygen
print("dissolved oxygen columns in DP1.20288.001 expanded package:")
print(waq_qf_names[grep("dissolvedOxygen", waq_qf_names)])
```
A quality flag (QF) of 0 indicates a pass, 1 indicates a fail, and -1 indicates
a test that could not be performed. For example, a range test cannot be
performed on missing measurements.
Detailed quality flags test results are all available in the
`package = 'expanded'` setting we specified when calling
`neonUtilities::loadByProduct()`. If we had specified `package = 'basic'`,
we wouldn't be able to investigate the detail in the type of data flag thrown.
We would only see the FinalQF columns.
The `AlphaQF` and `BetaQF` represent aggregated results of various QF tests,
and vary by a data product's algorithm. In most cases, an observation's
`AlphaQF = 1` indicates whether or not at least one QF was set to a value
of 1, and an observation's `BetaQF = 1` indicates whether or not at least one
QF was set to value of -1.
Let's consider what values were returned for each test.
```{r view-qf-do}
# so let us just look at those corresponding to dissolved
do_qf_names <- waq_qf_names[grep("dissolvedOxygen",waq_qf_names)]
for(col_nam in do_qf_names){
print(paste0(col_nam, " unique values: ",
paste0(unique(waq_instantaneous[,col_nam]),
collapse = ", ")))
}
```
Now let's consider the total number of flags generated for each quality test:
```{r dig-into-do, echo=TRUE}
# Loop across the do QF column names.
# Within each column, count the number of rows that equal '1'.
print("FLAG TEST - COUNT")
for (col_nam in do_qf_names){
totl_qf_in_col <- length(which(waq_instantaneous[,col_nam] == 1))
print(paste0(col_nam,": ",totl_qf_in_col))
}
print(paste0("Total observations: ", nrow(waq_instantaneous) ))
print(paste0("Percent flagged: ", sum(waq_instantaneous$dissolvedOxygenFinalQF)/nrow(waq_instantaneous) ))
print(paste0("Percent non-null flagged: ", (sum(waq_instantaneous$dissolvedOxygenFinalQF)-sum(waq_instantaneous$dissolvedOxygenNullQF))/(nrow(waq_instantaneous)-sum(waq_instantaneous$dissolvedOxygenNullQF))))
```
For specific details on the algorithms used to create a data product and its
corresponding quality tests, it's best to first check the data product's
Algorithm Theoretical Basis Document (ATBD). For water quality, that is
NEON.DOC.004931 listed as Documentation references in the README file and the
data product's web page.
If there any manual science review quality flags, the explanation for
flagging may also be viewed in the data product's README file or in the
data product's web page on NEON's data portal.
## Filtering (Some) Quality Flagged Observations
A simple approach to removing quality flagged observations is to remove data
when the finalQF is raised. Let's view dissolved oxygen in the context of its
final quality flags:
```{r plot-finalQF}
# sets the QF as a a factor rather than a continuous variable
waq_instantaneous$dissolvedOxygenFinalQF<-as.factor(waq_instantaneous$dissolvedOxygenFinalQF)
# plot
plot_DO <- ggplot()+
geom_point(data=waq_instantaneous,aes(x=startDateTime,y=dissolvedOxygen,color=dissolvedOxygenFinalQF),na.rm=TRUE) +
scale_color_manual(values = c("0" = "blue","1"="red")) +
ylim(0, 9) + ylab("DO (mg/L)") + xlab(" ") +
ggtitle("Little Rock Lake - Dissolved Oxygen by FinalQF")
plot_DO
```
The blue points corresponding to `dissolvedOxygenFinalQF = 0` represents
all data that were not flagged. Conversely, the red points corresponding
to `dissolvedOxygenFinalQF = 1` represents all data that were flagged,
What's going on here? It looks like all of the measurements collected
during the vertical profiles while the lake is stratified are getting
flagged. Let's create a similar plot, but using only the stepQF:
```{r plot-stepQF}
# sets the QF as a a factor rather than a continuous variable
waq_instantaneous$dissolvedOxygenStepQF<-as.factor(waq_instantaneous$dissolvedOxygenStepQF)
# plot
plot_DO <- ggplot()+
geom_point(data=waq_instantaneous,aes(x=startDateTime,y=dissolvedOxygen,color=dissolvedOxygenStepQF),na.rm=TRUE) +
scale_color_manual(values = c("0" = "blue","1"="red")) +
ylim(0, 9) + ylab("DO (mg/L)") + xlab(" ") +
ggtitle("Little Rock Lake - Dissolved Oxygen by StepQF")
plot_DO
```
Because there is a large change from one measurement to the next during
profiling, these measurements are getting an automated stepQF. In this
case however, we believe these differences to be real and not the result
of an erroneous measurement. This is a good example of why you should
not rely solely on the automated quality flags, especially the finalQF
in evaluating the data. We encourage users to download the expanded
data package (which contains individual QF) as well as reading the
documentation to understand how QFs are derived, before determining for
yourselves which data points to include or omit in your analysis.
```{r bathymetry}
# Code snippet for getting bathymetry data, https://data.neonscience.org/data-products/DP4.00132.001
bathymetryList <- neonUtilities::loadByProduct(dpID = "DP4.00132.001",
site = "LIRO",
startdate = "2020-01",
check.size = FALSE)
# This downloads files to a directory on your computer
# example path workingFolder <- "~/GitHub/WORKSHOP-GLEON-NEON-2022"
workingFolder <- "YOUR PATH HERE"
bathymetryData <- neonUtilities::zipsByURI(filepath = bathymetryList,
savepath = workingFolder,
check.size = FALSE)
# Read in most recent bathymetry metrics table
bathyVolMetrics <- read.csv(paste0(workingFolder,"/NEON_D05_LIRO_20200730_BATHYMETRY_L4_VB/D05_LIRO_BATH_20200730_volume.csv"))
surfaceAreaMetersSq <- bathyVolMetrics$X2d_area[bathyVolMetrics$Depth == 0]
bathyDepthMetrics <- read.csv(paste0(workingFolder,"/NEON_D05_LIRO_20200730_BATHYMETRY_L4_VB/D05_LIRO_BATH_20200730_tracks.csv"))
maxDepth <- min(bathyDepthMetrics$BottomElevation_m, na.rm = TRUE)
```