forked from NeotomaDB/EPD_binder
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathcomplex_workflow.Rmd
533 lines (371 loc) · 27.8 KB
/
complex_workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
---
title: "A Not so Simple Workflow - SDM Version"
author:
- name: "Nora Schlenker"
institute: [uwiscgeog]
correspondence: false
email: [email protected]
orcid_id: 0000-0002-3693-5946
- name: "Simon Goring"
institute: [uwiscgeog,uwiscdsi]
correspondence: true
email: [email protected]
orcid_id: 0000-0002-2700-4605
- name: "Socorro Dominguez Vidaña"
institute: [rhtdata]
correspondence: false
email: [email protected]
orcid_id: 0000-0002-7926-4935
institute:
- htdata:
name: "HT Data"
- uwiscgeog:
name: "University of Wisconsin -- Madison: Department of Geography"
- uwiscdsi:
name: "University of Wisconsin -- Data Science Institute"
date: "`r Sys.Date()`"
output:
html_document:
code_folding: show
fig_caption: yes
keep_md: yes
self_contained: yes
theme: readable
toc: yes
toc_float: yes
css: "text.css"
pdf_document:
pandoc_args: "-V geometry:vmargin=1in -V geometry:hmargin=1in"
dev: svg
highlight: tango
csl: 'https://bit.ly/3khj0ZL'
---
## Building Species Distribution Models from Pollen Data
This RMarkdown document will walk you through the process of:
1. Downloading pollen records from multiple sites -- [Loading Datasets](#loading-datasets)
2. Filtering for specific taxa and taxonomic harmonization -- [Data Filtering](#data-filtering)
3. Filtering and binning for specific time periods
4. Linking to environmental data
5. Performing simple SDMs for different time periods
This work-through is the companion to the Introduction to Neotoma [presentation](https://docs.google.com/presentation/d/1Fwp5yMAvIdgYpiC04xhgV7OQZ-olZIcUiGnLfyLPUt4/edit?usp=sharing).
## Disciplinary Knowledge
Before we start any analysis, we need to recognize that paleoecological data has some important considerations that need to be taken into account. Our "complex" workflow has been simplified to some degree, but, were we to work towards published results, these concepts need to be considered in our work.
### Taxonomies and Morphotaxa
Pollen "taxa" are identified by the structures of the pollen grains themselves, identified on microscope slides. The level of taxonomic resolution (e.g. species, genera, family) in pollen data varies depending on a number of factors, and, in some cases, does not directly align with standard taxonomies (such as the GBIF Backbone). These taxonomic issues generally fall along several main topics:
* **Taxonomic Resolution**: Many pollen taxa can only be identify to family or genus level due to similarity of pollen at finer taxonomic levels. An additional consideration is that although some pollen that can only be identify to a genus level may be able to be classified to a finer scale based if it is the only species within that genus within a geographic area, however, the pollen may only be identified to the genus level.
* **Analyst Experience**: The technology and skill available to the pollen counter can vary, both as a result of the overall experience and quality of equipment, but also based on where and what kinds of pollen they have identified in the past. An expert in western North American pollen taxa may have skill in this region, but might identify pollen to a lower taxonomic resolution when identifying pollen from Australia.
* **Naming conventions**: Just as analysts have differences in identification, the ways taxa are named varies by analyst and over time. For example, "Poaceae" and "Poaceae undiff." generally refer to the same pollen type but are recorded as different taxa within the database.
* **Uncertainty**: Neotoma uses a [well-documented system for indicating uncertainty](https://open.neotomadb.org/manual/taxonomy-related-tables-1.html#Taxa) in identification. Depending on the research question, we may accept more or less taxonomic uncertainty in our identification.
Due to this, a required step of working with pollen data is to harmonize taxon across sites. The level of resolution of your final taxon harmonization will depend on your research question and location of study.
### Chronologies and Ages
Chronologies underpin all of our paleoenvironmental analysis and interpretation. Neotoma's data model recognizes the importance of chronologies by providing a data model that supports multiple chronologies per record, and allows new analysts to add their own chronologies to existing records.
Some records in Neotoma are from the 1980s or earlier, and these records may have weak chronologies, relative to modern chronologies. They are built using linear interpolation, the radiocarbon dates are uncalibrated, and often, there is no uncertainty in the age predictions.
Our understanding of the relationship between ^14^C production in the upper atmosphere and its incorporation into organic material has improved over time, resulting in updates to the INTCAL curve. In addition, our chronology construction models have improved, fom simple linear interpolation to more complex Baysian models like BACON.
For this reason, researchers will often undertake chronology re-construction early on in their analysis to ensure there is a standardized approach to building the models.
### Age Uncertainty
Although each pollen sample is associated with an age (based on the age-depth model/chronology) it is also associated with uncertainty around that age. The amount of uncertainty varies depending on the number and type of dated material (i.e. radiocarbon from macrofossils vs. bulk carbon), the distance of a sample from dated material along the core, and the quality of the overall chronology. Age uncertainty means that pollen samples from two sites with the same estimated age do not neccessarily represent the same moment in time.
As statistical modelling has improved in the sciences, we have seen more rigorous methods for managing age uncertainty in analysis.
## Load Libraries
For this workshop we only need a few packages. We will download multiple records from Neotoma, filter by taxa and time period, and preform simple species distribution models.
We'll be using the R package `pacman` here, to automatically load and install packages. Note that we also use the package rpaleoclim to download the climate data, but since we have pre-downloaded the climate rasters for you, you will not need that package today.
```{r setup}
pacman::p_load(neotoma2, tidyverse, remotes, terra, sf, rpaleoclim)
```
## Loading Datasets
We worked through the process for finding and downloading records using `neotoma2` in the [previous workshop](https://open.neotomadb.org/Current_Workshop/simple_workflow.html). For this exercise we will be pulling European pollen records from Neotoma to build species distribution models (SDMs) for taxa at multiple time periods in the past.
To undertake this analysis we need to:
* Define a region of interest
* Download full records (sites, datasets & samples)
* Filter records for data quality (presence of chronologies, taxa, etc.)
### Define Spatial Domain {.tabset}
There are a large number of records in Europe. Downloading this volume of data from Neotoma can take a long time (~2hrs total) and is computationally expensive for the database itself. We have already completed the data download saved it to our data folder. Under the results you can take a look at all the sites we downloaded. As with the previous example, we can use `loc` parameter along with [WKT](https://arthur-e.github.io/Wicket/sandbox-gmaps3.html) or [geoJSON](http://geojson.io/#map=2/20.0/0.0) objects, and native `sf` objects.
#### Code
```{r geteurope, message = FALSE, eval = TRUE}
europe <- '{"type": "Polygon",
"coordinates": [[
[ -32.13,66.46],
[-14.09,36.93],
[30.16,34.71],
[35.79,69.17],
[-32.13,66.46]]]}'
europe_sf <- geojsonsf::geojson_sf(europe)
## The following lines of code are commented out because
# we've already run it for you to speed up the process.
# Downloading large amounts of data can take a long time.
#
# europe_downloads <- neotoma2::get_datasets(loc = europe,
# datasettype = "pollen",
# all_data = TRUE) %>%
# neotoma2::filter(!is.na(age_range_young)) %>%
# get_downloads(all_data = TRUE)
# europe_samples <- samples(europe_downloads)
# saveRDS(europe_downloads, "data/europe_downloads.RDS")
# saveRDS(europe_samples, "data/europe_samples.RDS")
europe_downloads <- readRDS("data/europe_downloads.RDS")
europe_samples <- readRDS("data/europe_samples.RDS")
```
#### Sites Map
```{r geteuropeshow, eval=TRUE, echo = FALSE}
neotoma2::plotLeaflet(europe_downloads) %>%
leaflet::addPolygons(map = .,
data = europe_sf,
color = "green")
```
#### Data table
```{r downloadsCode, echo = FALSE}
head(europe_samples, n = 10) %>%
dplyr::select(age,
variablename,
ecologicalgroup,
value,
units,
sitename,
lat,
long) %>%
DT::datatable(data = .,
options = list(scrollX = "100%", dom = 't'))
```
## Data Filtering
### Sample Filtering
When we look at pollen data we are looking at counts of pollen data, from a slide, for a particular analysis unit. Along with pollen there can also be measurements of charcoal, records of algal cycts, and other lab-based measurements. In addition, we often look only at terrestrial pollen types, since they generally reflect regional vegetation, as opposed to the pollen of aquatic types that would represent local site conditions.
Even though the datasets are pollen datasets, the `europe_samples` `data.frame` contains data for a number of different measured elements, including things like mandibles, scales and sporangia. In addition, these elements are measured using `r length(unique(europe_samples$units))` different measurement units (`unique(europe_samples$units)`) across `r length(unique(europe_samples$ecologicalgroup))` ecological groups (`unique(europe_samples$ecologicalgroup)`) including diatoms, trees and shrubs, rotifers and insects.
#### Standard Units and Elements
We often use pollen proportions in our analysis, and for this we need to find out how many total pollen grains were counted in each sample. We also want to make sure we're looking at a common set of taxa across all sites. In this case we will filter for trees and shrubs (ecological group `TRSH`) and upland herbs (ecological group `UPHE`). We will also look for pollen (element type `pollen`) that is measured using the Number of Identified Specimens (units `NISP`).
This makes the filtering rather straightforward:
```{r cutsamples}
europe_cleaner <- europe_samples %>%
filter(units == 'NISP' &
elementtype == 'pollen' &
(ecologicalgroup %in% c('TRSH', 'UPHE')))
```
This initial filtering removes about `r round(nrow(europe_samples) - nrow(europe_cleaner), -3)` rows from our original dataset.
#### Standardizing Taxa
For the purposes of this analysis we will only be focusing on building models for *Corylus*, *Fagus* and *Picea*.
Before we can start looking at our focal species we need to do some work with our sample table so that we have all the information we want. We need to do three things at this step: filtering the data to only include our focal taxa, doing taxon harmonization on our focal taxa, and calculating the pollen percent for each focal taxa for each pollen sample.
*Note:* Our taxon harmonization here is very simple and more thought should be put into taxon harmonization.
Our first task is to make a table that we will join with our samples table that tells us the total number of pollen per sample which we will use in our pollen percentage calculations. There are many decisions you can make at this step depending on your research question but here we are going to calculate pollen percent based on the total number of trees, shrubs, and herbaceous groups (ecologicalgroup = TRSH and UPHE). This is common for may studies that focus on these ecological groups but know that you can choose other combinations and this will effect how you interpret your results.
Then we can combine taxon filtering and harmonization and pollen percent calculation all in one segment of code. Here we are selecting the taxa *Fagus*, *Picea*, and *Corylus* and harmonizing all variations of those taxa to the genus level. We can use the code below to look at the diversity of identifiers for these three taxa:
```r
europe_cleaner %>%
filter(stringr::str_detect(variablename, "Fagus|Corylus|Picea")) %>%
select(variablename) %>% distinct()
```
For our purposes it will be enough to simply standardize all taxa identified as some form of *Picea* to *Picea*, and the same for the *Corylus* and *Fagus* types. This then lets us right some fairly clean code as follows:
##### Code
```{r replacetaxa, warning = FALSE, message = FALSE}
europe_cleaner <- europe_cleaner %>%
mutate(variablename = replace(variablename,
stringr::str_detect(variablename, "Fagus"),
"Fagus"),
variablename = replace(variablename,
stringr::str_detect(variablename, "Picea"),
"Picea"),
variablename = replace(variablename,
stringr::str_detect(variablename, "Corylus"),
"Corylus")) %>%
group_by(datasetid, age, variablename, units, ecologicalgroup, sitename, lat, long) %>%
summarise(value = sum(value)) %>% as.data.frame()
```
We have to `summarise` at the end to make sure that counts of taxon name variants within a single sample get lumped together properly. Once we've run this code, we can test out the earlier code block to see if we are now down to three distinct taxa (and we ought to be).
**Note**: We're using `as.data.frame()` here to speed up operations below. Leaving this object as a `grouped_df` seem to seriously impact performance.
### Age Filtering
The age ranges for these records may be broad. Some European records go back over 100,000 years before present, and some samples (and datasets) appear to have no chronologies. We can check the ages using the columns `age`, `ageolder` and `ageyounger`. Note that many chronologies have been entered without uncertainty measures, in which case `ageolder` and `ageyounger` would be unreported. Otherwise these values represent ±1 standard deviation.
We will be building our SDMs for three critical time periods in the time since the last glaciation:
| time period | age range (ybp) | label |
| ----------- | --------------- | ----- |
| Modern | -76 - 150 | `Modern` |
| Younger Dryas | 11,700 - 12,900 | `YD` |
| Last Glacial Maximum | > 21,000 | `LGM` |
To prepare the data for analysis we will create a new column using `mutate`, based on our time bins, and filter out any records that are not within our defined periods of study:
#### Code
```{r agecleaning}
europe_ready <- europe_cleaner %>%
mutate(timebins = case_when(age >= -72 & age <= 150 ~ "Modern",
age >= 11700 & age <= 12900 ~ "YD",
age >= 20000 & age <= 22000 ~ "LGM",
.default = NA)) %>%
filter(!is.na(timebins)) %>%
as.data.frame()
```
###
Our data is now ready for further analysis. We've cleaned the taxonomy for our three focal taxa, ensured all data are of a similar type and measured with similar units, and we've made sure that our time bins are clearly defined.
*Note:* We do not attempt to update age-depth models and we filter out all samples that do not have age estimates when some of them may be able to be included under closer inspection. Additionally, we do not take into account time uncertainty which is another aspect that should be considered when filtering by time.
### Calculating Pollen Sums, Percentages, and Presence {.tabset}
From here we need to calculate the pollen sum for each sample in our set of samples. This means that we will get the total number of pollen grains counted for e.g., Kansjön at depth 445cm in the core. We can calculate both the pollen sum and the proportion at the same time by using `group_by()` and `ungroup()` to first sum the `value` fields (the counts) and then calculate the proportion from our new `totalpollen` field:
```{r}
europe_subset <- europe_ready %>%
group_by(datasetid, age, sitename, lat, long, timebins) %>%
mutate(totalpollen = sum(value)) %>%
ungroup() %>%
mutate(polprop = round(value / totalpollen, 2)) %>% #calculates pollen proportion of total pollen counted per sample
mutate(presence = case_when(polprop > .01 ~ 1,
.default = 0)) %>% #calculates presence of a taxa when it has >1% pollen proportion
as.data.frame()
```
#### Data Table
```{r filtertaxaShow, eval = TRUE, message = FALSE, echo = FALSE}
europe_subset[1:10,] %>%
DT::datatable(data = .,
options = list(scrollX = "100%", dom = 't'))
```
#### Samples by Time Period
```{r agecleaningshow, echo=FALSE, warning=FALSE, message=FALSE}
europe_ready %>%
select(variablename, timebins) %>%
filter(variablename %in% c("Fagus", "Corylus", "Picea")) %>%
mutate(timebins = factor(timebins, levels = c("Modern", "YD", "LGM")),
taxa = factor(variablename, levels = c("Fagus", "Corylus", "Picea"))) %>%
ggplot(aes(timebins)) + geom_bar(stat = "count") +
geom_text(stat='count', aes(label=after_stat(count)), vjust=-.5) +
facet_wrap(~taxa, ncol = 3) +
ylab("Count of Samples") + xlab("Time Bins") + theme_minimal()
```
###
## Loading Climate Data {.tabset}
There is no climate data associated with pollen data downloaded from Neotoma but you can utilize climate data that is available online to intersect with the locations of pollen samples. For this we will be using climate data from [Paleoclim](http://www.paleoclim.org/) which compiles climate data from present to the Pliocene from various sources. We will be using their modern and LGM climate data from [CHELSA](https://chelsa-climate.org/) and Younger Dryas climate data from [Fordham et al., 2017](https://nsojournals.onlinelibrary.wiley.com/doi/full/10.1111/ecog.03031).
### Code
```{r downloadclims, message = FALSE, eval = TRUE}
# europe_clim_mod <- paleoclim("cur", "5m", region = terra::ext(europe_sf))
# europe_clim_yd <- paleoclim("yds", "5m", region = terra::ext(europe_sf))
# europe_clim_lgm <- paleoclim("lgm", "5m", region = terra::ext(europe_sf))
#
# saveRDS(europe_clim_mod, "data/europe_clim_mod.RDS")
# saveRDS(europe_clim_yd, "data/europe_clim_yd.RDS")
# saveRDS(europe_clim_lgm, "data/europe_clim_lgm.RDS")
europe_clim_mod <- readRDS("data/europe_clim_mod.RDS")
europe_clim_yd <- readRDS("data/europe_clim_yd.RDS")
europe_clim_lgm <- readRDS("data/europe_clim_lgm.RDS")
```
### Modern
```{r downloadclimsShow, echo = FALSE, message = FALSE, eval = TRUE}
plot(europe_clim_mod[["bio_1"]]/10, main = "Modern Mean Annual Temp, Europe")
```
### Younger Dryas
```{r downloadclimsShow2, echo = FALSE, message = FALSE, eval = TRUE}
plot(europe_clim_yd[["bio_1"]]/10, main = "Younger Dryas Mean Annual Temp, Europe")
```
### Last Glacial Maximum
```{r downloadclimsShow3, echo = FALSE, message = FALSE, eval = TRUE}
plot(europe_clim_lgm[["bio_1"]]/10, main = "LGM Mean Annual Temp, Europe")
```
## Extracting Environmental Variables {.tabset}
Now that we have cleaned taxa data and have loaded in climate data we can extract environmental variables for each pollen sample based on the lat/long of the collection unit. Below is a function that will extract the [19 biocliamte variables](https://chelsa-climate.org/bioclim/) from the climate rasters.
### Extracting
```{r extractclims, message = FALSE, eval = TRUE}
#First lets save the projection information from our climate data to use later when creating our species points data
proj_data <- crs(europe_clim_mod)
extract_clims <- function(taxa_data, clim_data, time_text, taxa_text, projection = proj_data) {
time_cut <- dplyr::filter(taxa_data, timebins == time_text)
time_sf <- st_as_sf(time_cut, coords = c("long", "lat"), crs = projection)
time_sf_vect <- vect(time_sf$geometry)
time_sf_clims <- terra::extract(clim_data, time_sf_vect)
time_sf_full <- cbind(time_sf, time_sf_clims)
dplyr::filter(time_sf_full, variablename == taxa_text)
}
## Fagus
fagus_mod <- extract_clims(taxa_data = europe_subset, clim_data = europe_clim_mod,
time_text = "Modern", taxa_text = "Fagus", projection = proj_data)
fagus_yd <- extract_clims(taxa_data = europe_subset, clim_data = europe_clim_yd,
time_text = "YD", taxa_text = "Fagus", projection = proj_data)
fagus_lgm <- extract_clims(taxa_data = europe_subset, clim_data = europe_clim_lgm,
time_text = "LGM", taxa_text = "Fagus", projection = proj_data)
```
### Fagus E-Space 1D
```{r extractclimsshow1, echo = FALSE, message = FALSE, eval = TRUE, warning = FALSE}
rbind(fagus_mod, fagus_yd, fagus_lgm) %>%
select(timebins, polprop, bio_1, bio_12, bio_4, bio_15, geometry) %>%
pivot_longer(-c(timebins, polprop, geometry)) %>%
mutate(timebins_fact = factor(timebins, levels = c("Modern", "YD", "LGM"))) %>%
ggplot(aes(value, polprop)) + geom_point() +
facet_grid(timebins_fact~name, scales = "free", switch = "x") +
xlab("") +
ylab("Pollen Percent")+
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
### Fagus E-Space 2D
```{r extractclimsshow2, echo = FALSE, message = FALSE, eval = TRUE, warning = FALSE}
rbind(fagus_mod, fagus_yd, fagus_lgm) %>%
select(timebins, polprop, bio_1, bio_12, bio_4, bio_15, geometry) %>%
mutate(timebins_fact = factor(timebins, levels = c("Modern", "YD", "LGM"))) %>%
ggplot(aes(bio_1, bio_12, size = polprop)) + geom_point() +
facet_wrap(~timebins_fact, ncol = 3, scales = "fixed", switch = "x") +
xlab("Mean Annual Temperature") +
ylab("Annual Precipitation")+
theme_bw() +
theme(axis.text.x = element_text(angle = 45))
```
## Exercise 1 - Predicting Modern Presence {.tabset}
For our first exercise we will be exploring how pollen data can be used to predict suitable climates for taxa under current conditions. For this first example we will compare two models, one which uses pollen proportion as input to predict pollen abundance and another which uses the presence only data to predict likelihood of occurrence using a binomial distribution.
### Code
```{r exercise1, message = FALSE, eval = TRUE}
# Create a GLM using pollen proportion data using the gaussian distribution
fagus_mod_glm <- glm(polprop ~ bio_1 + bio_12, data = fagus_mod, family = gaussian)
fagus_mod_glm_predict <- predict(c(europe_clim_mod[["bio_1"]], europe_clim_mod[["bio_12"]]), fagus_mod_glm)
# Create a GLM using presence only data using the binomial distribution
fagus_mod_glm_pres <- glm(presence ~ bio_1 + bio_12, data = fagus_mod, family = binomial)
fagus_mod_glm_pres_predict <- predict(c(europe_clim_mod[["bio_1"]], europe_clim_mod[["bio_12"]]), fagus_mod_glm_pres)
```
### Proportion Model Fit
```{r excersise1show1, echo = FALSE, message = FALSE, eval = TRUE}
summary(fagus_mod_glm)
```
### Presence Model Fit
```{r excersise1show2, echo = FALSE, message = FALSE, eval = TRUE}
summary(fagus_mod_glm_pres)
```
### Maps
```{r excersise1show3, echo = FALSE, message = FALSE, eval = TRUE, out.width = "50%"}
plot(fagus_mod_glm_predict, main = "Predicted Fagus Pollen Percent, Modern")
plot(fagus_mod$geometry, add = TRUE)
plot(fagus_mod_glm_pres_predict, main = "Predicted Fagus Presence, Modern")
plot(fagus_mod$geometry[which(fagus_mod$presence == 1)], add = TRUE)
```
##
### Your Task
Using the example above create your own model of modern species presence either using one of our other taxa (*Corylus* or *Picea*), You can also experiment with using different combinations of environmental variables (we have all 19 bioclimatic variables) or even use a different model type!
When you are finished you can share your results by pasting a screenshot of a figure to our [presentation](https://docs.google.com/presentation/d/1Fwp5yMAvIdgYpiC04xhgV7OQZ-olZIcUiGnLfyLPUt4/edit?usp=sharing) on the Exercise 1 slides.
## Exercise 2 - Hindcasting Species Presence {.tabset}
Now lets take a look at how we can utilize the power of paleoecological data to create and/or validate distribution models. Here we will use the same model created above for Fagus presence during modern times and hindcast it using the climate data from the LGM. We will then create a new model using the pollon proportion data of pollen at the LGM and predict species presence at the LGM. We can then compare the two models. Although we are not doing this formally here you can also use taxa pollen presence in the past to formally validate models created with modern species-environment relationships.
### Code
```{r exercise2, message = FALSE, eval = TRUE}
# Create a GLM using pollen proportion data using the gaussian distribution
fagus_modtolgm_glm_predict <- predict(c(europe_clim_lgm[["bio_1"]], europe_clim_lgm[["bio_12"]]), fagus_mod_glm)
# Create a GLM using presence only data using the binomial distribution
fagus_lgm_glm <- glm(polprop ~ bio_1 + bio_12, data = fagus_lgm, family = gaussian)
fagus_lgmtolgm_glm_predict <- predict(c(europe_clim_lgm[["bio_1"]], europe_clim_lgm[["bio_12"]]), fagus_lgm_glm)
```
### LGM Model Fit
```{r excersise2show1, echo = FALSE, message = FALSE, eval = TRUE}
summary(fagus_lgm_glm)
```
### Maps
```{r excersise2show3, echo = FALSE, message = FALSE, eval = TRUE, out.width = "50%"}
plot(fagus_modtolgm_glm_predict, main = "Predicted Fagus Pollen Proportion, Modern to LGM")
plot(fagus_lgm$geometry, add = TRUE)
plot(fagus_lgmtolgm_glm_predict, main = "Predicted Fagus Pollen Proportion, LGM to LGM")
plot(fagus_lgm$geometry, add = TRUE)
```
##
### Your Task
Using the example above create your own model of pollen proportions of one of our other taxa (*Corylus* or *Picea*) in one of our past time periods (Younger Dryas and Last Glacial Maximum). You can also experiment with using different combinations of environmental variables (we have all 19 bioclimatic variables) or even use a different model type!
When you are finished you can share your results by pasting a screenshot of a figure to our [presentation](https://docs.google.com/presentation/d/1Fwp5yMAvIdgYpiC04xhgV7OQZ-olZIcUiGnLfyLPUt4/edit?usp=sharing) on the Exercise 2 slides.
## Exercise 3 - Choose your own adventure!
Now that we have walked through getting your data from the Neotoma Paleoecology Database, correlating pollen samples with environmental variables, and explored a few ways that you can use the data you can now experiment with out types of questions we can answer with this data:
* Do SDMs created for past climates accurately predict where species are today?
* Do pollen percent or presence of taxa occupy different portions of climate space through time?
* Many more!
Now it is your turn to come up with a question that can be answered by our data and create code to answer your question. When you are finished you can share your results by pasting a screenshot of a figure to our [presentation](https://docs.google.com/presentation/d/1Fwp5yMAvIdgYpiC04xhgV7OQZ-olZIcUiGnLfyLPUt4/edit?usp=sharing) on the Exercise 3 slides.
## Summary
From this notebook we have learned how to:
1. Downloading pollen records from multiple sites
2. Filtering for specific taxa and taxonomic harmonization
3. Filtering and binning for specific time periods
4. Linking to environmental data
5. Performing simple SDMs for different time periods
This approach is very simple workflow for taking records from many sites and combining them into species distribution models to compare species distributions at different periods of time.
### Other Helpful R Packages
These packages may be helpful to you for future paleoecological analysis:
* [RFossilpoll](https://hope-uib-bio.github.io/FOSSILPOL-website/) - package to help streamline the data downloading and filtering of paleoecological data from Neotoma and other resources
* [R-Ratepol](https://hope-uib-bio.github.io/R-Ratepol-package/) - package to calculate rate of change from community time series data
* [rpaleoclim](http://www.paleoclim.org/) - package to facilitate direct download of high resolution paleo-climate data