Skip to content

Commit

Permalink
finish modeling-01
Browse files Browse the repository at this point in the history
  • Loading branch information
btupper committed Oct 7, 2023
1 parent 28dabeb commit f52baa0
Show file tree
Hide file tree
Showing 9 changed files with 331 additions and 16 deletions.
219 changes: 206 additions & 13 deletions docs/modeling-01.html

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/search.json
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,6 @@
"href": "modeling-01.html#make-a-prediction",
"title": "Basic modeling",
"section": "6 Make a prediction",
"text": "6 Make a prediction\nNow we can make predictions with our basic model. We’ll do it two ways. First by simply feeding the input data used to create the model into the prediction. This might seems a bit circular, but it is perfectly reasonable to see how the model does on already labeled data. Second we’ll make a prediction for each month in 2020 using raster data.\n\n6.1 Predict with a data frame\nHere we provide a data frame, in our case the original input data, to the predict() function with type cloglog which transform the response value into the 0-1 range.\n\nprediction = predict(model, input_table, type = 'cloglog')\nsummary(prediction)\n\n V1 \n Min. :0.0001967 \n 1st Qu.:0.3653550 \n Median :0.5596439 \n Mean :0.5364945 \n 3rd Qu.:0.7339428 \n Max. :0.8901985 \n\n\n\n6.1.1 How did it do?\nWe can use some utilities in the maxnetic package to help us assess the model. First, we need to create a table with two columns: label and pred. Label is the simple a vector of 0/1 indicating that the predicted value is known to be either background or presence. We already have that in our input_vector. Pred is simple the 0-1 scale predicted value. Once we have that we can craft a receiver operator characteristic curve and compute it’s AUC.\n\nx = dplyr::tibble(label = input_vector, pred = as.vector(prediction))\nplot_ROC(x, title = \"v1.0 Basic Model\")\n\n\n\n\nOverall, this is tellign us that the model isn’t especially strong as a prediction tool, but it is much better than a 50-50 guess (that’s when AUC is close to 0.5, and the curve follows the light grey line). Learn more about ROC and AUC here.\n\n\n\n6.2 Predict with rasters\n\n6.2.1 How did it do?"
"text": "6 Make a prediction\nNow we can make predictions with our basic model. We’ll do it two ways. First by simply feeding the input data used to create the model into the prediction. This might seems a bit circular, but it is perfectly reasonable to see how the model does on already labeled data. Second we’ll make a prediction for each month in 2020 using raster data.\n\n6.1 Predict with a data frame\nHere we provide a data frame, in our case the original input data, to the predict() function with type cloglog which transform the response value into the 0-1 range.\n\nprediction = predict(model, input_table, type = 'cloglog')\nhist(prediction, xlab = \"prediction\", main = \"Basic Model\")\n\n\n\n\n\n6.1.1 How did it do?\nWe can use some utilities in the maxnetic package to help us assess the model. First, we need to create a table with two columns: label and pred. Label is the simple a vector of 0/1 indicating that the predicted value is known to be either background or presence. We already have that in our input_vector. Pred is simple the 0-1 scale predicted value. Once we have that we can craft a receiver operator characteristic curve and compute it’s AUC.\n\nx = dplyr::tibble(label = input_vector, pred = as.vector(prediction))\nplot_ROC(x, title = \"v1.0 Basic Model\")\n\n\n\n\nOverall, this is telling us that the model isn’t especially strong as a prediction tool, but it is much better than a 50-50 guess (that’s when AUC is close to 0.5, and the curve follows the light grey line). Learn more about ROC and AUC here.\n\n\n\n6.2 Predict with rasters\nWe can also predict using raster inputs using our basic model. Let’s read in rasters for each month of 2018, and then run a prediction for each month.\n\ndates = as.Date(c(\"2019-01-01\", \"2019-12-31\"))\n\nsst_path = \"data/oisst\"\nsst_db = oisster::read_database(sst_path) |>\n dplyr::arrange(date) |>\n dplyr::filter(dplyr::between(date, dates[1], dates[2]))\n \n\nsst = sst_db |>\n oisster::compose_filename(path = sst_path) |>\n stars::read_stars(along = list(time = sst_db$date)) |>\n rlang::set_names(\"sst\")|>\n st_to_180()\n\n\nwind_path = \"data/nbs\"\nwind_db = nbs::read_database(wind_path) |>\n dplyr::arrange(date)|>\n dplyr::filter(dplyr::between(date, dates[1], dates[2]))\n\nu_wind_db = wind_db |>\n dplyr::filter(param == \"u_wind\")|>\n dplyr::filter(dplyr::between(date, dates[1], dates[2]))\nu_wind = u_wind_db |>\n nbs::compose_filename(path = wind_path) |>\n stars::read_stars(along = list(time = u_wind_db$date)) |>\n rlang::set_names(\"u_wind\") |>\n st_to_180()\n\nv_wind_db = wind_db |>\n dplyr::filter(param == \"v_wind\")|>\n dplyr::filter(dplyr::between(date, dates[1], dates[2]))\nv_wind = v_wind_db |>\n nbs::compose_filename(path = wind_path) |>\n stars::read_stars(along = list(time = v_wind_db$date)) |>\n rlang::set_names(\"v_wind\") |>\n st_to_180()\n\nOnce we have them in hand we need to bind them together. But we need to attend to common but important issue. The sst rasters and windspeed rasters have different extents. We can’t bind them together until we warp one set to match the other. Let’s warp sst to match u_wind. And then we can bind them together.\n\nsst_warped = stars::st_warp(sst, u_wind)\nx = list(sst_warped, u_wind, v_wind)\npredictors = do.call(c, append(x, list(along = NA_integer_))) \npredictors\n\nstars object with 3 dimensions and 3 attributes\nattribute(s):\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's\nsst -1.558928 12.528449 19.5220385 17.6005908 23.501083 29.216452 11352\nu_wind -2.692028 1.144244 2.7007004 2.7228278 4.115177 13.148120 7612\nv_wind -5.431324 -1.411349 -0.3202622 -0.1398384 1.106175 4.636874 7612\ndimension(s):\n from to offset delta refsys point values x/y\nx 1 74 -76.38 0.25 WGS 84 FALSE NULL [x]\ny 1 46 46.12 -0.25 WGS 84 FALSE NULL [y]\ntime 1 12 NA NA Date NA 2019-01-01,...,2019-12-01 \n\n\nNow we can run the prediction.\n\npred = predict(model, predictors, type = 'cloglog')\npred\n\nstars object with 3 dimensions and 1 attribute\nattribute(s):\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's\npred 0.0001196393 0.1200618 0.2675931 0.3033565 0.4398977 0.8816952 11487\ndimension(s):\n from to offset delta refsys point values x/y\nx 1 74 -76.38 0.25 WGS 84 FALSE NULL [x]\ny 1 46 46.12 -0.25 WGS 84 FALSE NULL [y]\ntime 1 12 NA NA Date NA 2019-01-01,...,2019-12-01 \n\n\nSince we get a spatially mapped prediction back, we can plot it.\n\ncoast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf') |>\n sf::st_crop(pred)\n\nWarning: attribute variables are assumed to be spatially constant throughout\nall geometries\n\nplot_coast = function() {\n plot(sf::st_geometry(coast), col = 'green', add = TRUE)\n}\nplot(pred, hook = plot_coast)\n\n\n\n\nWell, that certainly looks appealing with higher likelihood of near shore observations occurring during the warmer months.\n\n6.2.1 How did it do?\nTo compute an ROC and AUC for each month, we have a little bit of work to do. We need to extract the observations and background for each month from the prediction maps. These we can then pass to the plot_ROC() function.\n\n\n\n\n\n\nNote\n\n\n\nWe have to modify the date for each point to be the first date of each month. That’s because our predictors are monthlies.\n\n\n\ntest_obs = obs |>\n dplyr::filter(dplyr::between(date, dates[1], dates[2])) |>\n dplyr::select(dplyr::all_of(\"date\")) |>\n dplyr::mutate(date = oisster::current_month(date))\n\ntest_bkg = bkg |>\n dplyr::filter(dplyr::between(date, dates[1], dates[2])) |>\n dplyr::select(dplyr::all_of(\"date\")) |>\n dplyr::mutate(date = oisster::current_month(date))\n\ntest_input = dplyr::bind_rows(test_obs, test_bkg)\n\nx = stars::st_extract(pred, test_input, time_column = 'date') |>\n print()\n\nSimple feature collection with 1537 features and 3 fields\nGeometry type: POINT\nDimension: XY\nBounding box: xmin: -75.99915 ymin: 35.01635 xmax: -58.83057 ymax: 45.95233\nGeodetic CRS: WGS 84\nFirst 10 features:\n pred time date geometry\n1 0.2759255 2019-05-01 2019-05-01 POINT (-67.32935 40.42509)\n2 0.7245142 2019-03-01 2019-03-01 POINT (-74.41057 36.49908)\n3 0.6664676 2019-12-01 2019-12-01 POINT (-75.3994 35.9457)\n4 0.4536477 2019-06-01 2019-06-01 POINT (-75.10864 36.94806)\n5 0.6864945 2019-04-01 2019-04-01 POINT (-74.49892 36.57275)\n6 0.3105710 2019-09-01 2019-09-01 POINT (-75.5519 36.1854)\n7 0.3874695 2019-09-01 2019-09-01 POINT (-73.6245 40.3317)\n8 0.3785449 2019-04-01 2019-04-01 POINT (-69.04389 39.82132)\n9 0.7447747 2019-04-01 2019-04-01 POINT (-74.59436 36.87291)\n10 0.7447747 2019-04-01 2019-04-01 POINT (-74.45753 36.72279)\n\n\nFinally we can build a table that merges the prediction with the labels. We are going to add the name of the month to group by that.\n\ny = x |>\n dplyr::mutate(label = c(rep(1, nrow(test_obs)), rep(0, nrow(test_bkg))),\n month = factor(format(date, \"%b\"), levels = month.abb), \n .before = 2) |>\n sf::st_drop_geometry() |>\n dplyr::select(dplyr::all_of(c(\"month\", \"label\", \"pred\"))) |>\n dplyr::group_by(month) \n\ndplyr::count(y, month, label) |>\n print(n = 24)\n\n# A tibble: 24 × 3\n# Groups: month [12]\n month label n\n <fct> <dbl> <int>\n 1 Jan 0 36\n 2 Jan 1 21\n 3 Feb 0 15\n 4 Feb 1 7\n 5 Mar 0 46\n 6 Mar 1 23\n 7 Apr 0 259\n 8 Apr 1 169\n 9 May 0 182\n10 May 1 119\n11 Jun 0 73\n12 Jun 1 53\n13 Jul 0 76\n14 Jul 1 48\n15 Aug 0 46\n16 Aug 1 39\n17 Sep 0 48\n18 Sep 1 21\n19 Oct 0 102\n20 Oct 1 79\n21 Nov 0 27\n22 Nov 1 19\n23 Dec 0 15\n24 Dec 1 14\n\n\nNow how about one ROC plot for each month? Yikes! This requires a iterative approach, using group_map(), to compute the ROC for each month. We then follow with plot wrapping by the patchwork package.\n\nrocs = dplyr::group_map(y, \n function(tbl, key){\n maxnetic::plot_ROC(tbl, title = sprintf(\"%s, n = %i\", key$month, nrow(tbl)), \n xlab = \"\", ylab = \"\")\n })\n\npatchwork::wrap_plots(rocs, ncol = 4)\n\n\n\n\nHmmm. That’s surprising, yes? Why during the summer months does our AUC go down. In fact, at times we are predicting the likelihood of not having an observation reported. It’s hard to know what to think, but consider that we are using a model generated across all months of multiple years and it might not predict a particular month and year very well. A step toward refinement, our next step is to make 12 models, one for each month."
}
]
126 changes: 124 additions & 2 deletions modeling-01.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,9 @@ Now we can make predictions with our basic model. We'll do it two ways. First by
Here we provide a data frame, in our case the original input data, to the `predict()` function with type `cloglog` which transform the response value into the 0-1 range.

```{r}
#| width: "100%"
prediction = predict(model, input_table, type = 'cloglog')
summary(prediction)
hist(prediction, xlab = "prediction", main = "Basic Model")
```

#### How did it do?
Expand All @@ -121,16 +122,137 @@ We can use some utilities in the [maxnetic](https://github.com/BigelowLab/maxnet
x = dplyr::tibble(label = input_vector, pred = as.vector(prediction))
plot_ROC(x, title = "v1.0 Basic Model")
```
Overall, this is tellign us that the model isn't especially strong as a prediction tool, but it is much better than a 50-50 guess (that's when AUC is close to 0.5, and the curve follows the light grey line). Learn more about ROC and AUC [here](https://rviews.rstudio.com/2019/01/17/roc-curves/).
Overall, this is telling us that the model isn't especially strong as a prediction tool, but it is much better than a 50-50 guess (that's when AUC is close to 0.5, and the curve follows the light grey line). Learn more about ROC and AUC [here](https://rviews.rstudio.com/2019/01/17/roc-curves/).


### Predict with rasters

We can also predict using raster inputs using our basic model. Let's read in rasters for each month of 2018, and then run a prediction for each month.

```{r}
dates = as.Date(c("2019-01-01", "2019-12-31"))
sst_path = "data/oisst"
sst_db = oisster::read_database(sst_path) |>
dplyr::arrange(date) |>
dplyr::filter(dplyr::between(date, dates[1], dates[2]))
sst = sst_db |>
oisster::compose_filename(path = sst_path) |>
stars::read_stars(along = list(time = sst_db$date)) |>
rlang::set_names("sst")|>
st_to_180()
wind_path = "data/nbs"
wind_db = nbs::read_database(wind_path) |>
dplyr::arrange(date)|>
dplyr::filter(dplyr::between(date, dates[1], dates[2]))
u_wind_db = wind_db |>
dplyr::filter(param == "u_wind")|>
dplyr::filter(dplyr::between(date, dates[1], dates[2]))
u_wind = u_wind_db |>
nbs::compose_filename(path = wind_path) |>
stars::read_stars(along = list(time = u_wind_db$date)) |>
rlang::set_names("u_wind") |>
st_to_180()
v_wind_db = wind_db |>
dplyr::filter(param == "v_wind")|>
dplyr::filter(dplyr::between(date, dates[1], dates[2]))
v_wind = v_wind_db |>
nbs::compose_filename(path = wind_path) |>
stars::read_stars(along = list(time = v_wind_db$date)) |>
rlang::set_names("v_wind") |>
st_to_180()
```


Once we have them in hand we need to bind them together. But we need to attend to common but important issue. The `sst` rasters and `windspeed` rasters have different extents. We can't bind them together until we warp one set to match the other. Let's warp `sst` to match `u_wind`. And then we can bind them together.

```{r}
sst_warped = stars::st_warp(sst, u_wind)
x = list(sst_warped, u_wind, v_wind)
predictors = do.call(c, append(x, list(along = NA_integer_)))
predictors
```

Now we can run the prediction.

```{r}
pred = predict(model, predictors, type = 'cloglog')
pred
```

Since we get a spatially mapped prediction back, we can plot it.

```{r}
coast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf') |>
sf::st_crop(pred)
plot_coast = function() {
plot(sf::st_geometry(coast), col = 'green', add = TRUE)
}
plot(pred, hook = plot_coast)
```

Well, that certainly looks appealing with higher likelihood of near shore observations occurring during the warmer months.

#### How did it do?

To compute an ROC and AUC for each month, we have a little bit of work to do. We need to extract the observations and background for each month from the prediction maps. These we can then pass to the `plot_ROC()` function.

:::{.callout-note}
We have to modify the date for each point to be the first date of each month. That's because our predictors are monthlies.
:::

```{r}
test_obs = obs |>
dplyr::filter(dplyr::between(date, dates[1], dates[2])) |>
dplyr::select(dplyr::all_of("date")) |>
dplyr::mutate(date = oisster::current_month(date))
test_bkg = bkg |>
dplyr::filter(dplyr::between(date, dates[1], dates[2])) |>
dplyr::select(dplyr::all_of("date")) |>
dplyr::mutate(date = oisster::current_month(date))
test_input = dplyr::bind_rows(test_obs, test_bkg)
x = stars::st_extract(pred, test_input, time_column = 'date') |>
print()
```

Finally we can build a table that merges the prediction with the labels. We are going to add the name of the month to group by that.

```{r}
y = x |>
dplyr::mutate(label = c(rep(1, nrow(test_obs)), rep(0, nrow(test_bkg))),
month = factor(format(date, "%b"), levels = month.abb),
.before = 2) |>
sf::st_drop_geometry() |>
dplyr::select(dplyr::all_of(c("month", "label", "pred"))) |>
dplyr::group_by(month)
dplyr::count(y, month, label) |>
print(n = 24)
```

Now how about one ROC plot for each month? Yikes! This requires a iterative approach, using `group_map()`, to compute the ROC for each month. We then follow with plot wrapping by the [patchwork](https://patchwork.data-imaginist.com/articles/guides/assembly.html#functional-assembly) package.

```{r}
#| width: "100%"
rocs = dplyr::group_map(y,
function(tbl, key){
maxnetic::plot_ROC(tbl, title = sprintf("%s, n = %i", key$month, nrow(tbl)),
xlab = "", ylab = "")
})
patchwork::wrap_plots(rocs, ncol = 4)
```

Hmmm. That's surprising, yes? Why during the summer months does our AUC go down. In fact, at times we are predicting the likelihood of **not** having an observation reported. It's hard to know what to think, but consider that we are using a model generated across all months of multiple years and it might not predict a particular month and year very well. A step toward refinement, our next step is to make 12 models, one for each month.


Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit f52baa0

Please sign in to comment.