finish modeling-01

BigelowLab · Oct 7, 2023 · f52baa0 · f52baa0
1 parent 28dabeb
commit f52baa0
Show file tree

Hide file tree

Showing 9 changed files with 331 additions and 16 deletions.
diff --git a/docs/modeling-01.html b/docs/modeling-01.html
diff --git a/docs/modeling-01_files/figure-html/unnamed-chunk-12-1.png b/docs/modeling-01_files/figure-html/unnamed-chunk-12-1.png
diff --git a/docs/modeling-01_files/figure-html/unnamed-chunk-15-1.png b/docs/modeling-01_files/figure-html/unnamed-chunk-15-1.png
diff --git a/docs/modeling-01_files/figure-html/unnamed-chunk-7-1.png b/docs/modeling-01_files/figure-html/unnamed-chunk-7-1.png
diff --git a/docs/search.json b/docs/search.json
@@ -172,6 +172,6 @@
     "href": "modeling-01.html#make-a-prediction",
     "title": "Basic modeling",
     "section": "6 Make a prediction",
-    "text": "6 Make a prediction\nNow we can make predictions with our basic model. We’ll do it two ways. First by simply feeding the input data used to create the model into the prediction. This might seems a bit circular, but it is perfectly reasonable to see how the model does on already labeled data. Second we’ll make a prediction for each month in 2020 using raster data.\n\n6.1 Predict with a data frame\nHere we provide a data frame, in our case the original input data, to the predict() function with type cloglog which transform the response value into the 0-1 range.\n\nprediction = predict(model, input_table, type = 'cloglog')\nsummary(prediction)\n\n       V1           \n Min.   :0.0001967  \n 1st Qu.:0.3653550  \n Median :0.5596439  \n Mean   :0.5364945  \n 3rd Qu.:0.7339428  \n Max.   :0.8901985  \n\n\n\n6.1.1 How did it do?\nWe can use some utilities in the maxnetic package to help us assess the model. First, we need to create a table with two columns: label and pred. Label is the simple a vector of 0/1 indicating that the predicted value is known to be either background or presence. We already have that in our input_vector. Pred is simple the 0-1 scale predicted value. Once we have that we can craft a receiver operator characteristic curve and compute it’s AUC.\n\nx = dplyr::tibble(label = input_vector, pred = as.vector(prediction))\nplot_ROC(x, title = \"v1.0 Basic Model\")\n\n\n\n\nOverall, this is tellign us that the model isn’t especially strong as a prediction tool, but it is much better than a 50-50 guess (that’s when AUC is close to 0.5, and the curve follows the light grey line). Learn more about ROC and AUC here.\n\n\n\n6.2 Predict with rasters\n\n6.2.1 How did it do?"
+    "text": "6 Make a prediction\nNow we can make predictions with our basic model. We’ll do it two ways. First by simply feeding the input data used to create the model into the prediction. This might seems a bit circular, but it is perfectly reasonable to see how the model does on already labeled data. Second we’ll make a prediction for each month in 2020 using raster data.\n\n6.1 Predict with a data frame\nHere we provide a data frame, in our case the original input data, to the predict() function with type cloglog which transform the response value into the 0-1 range.\n\nprediction = predict(model, input_table, type = 'cloglog')\nhist(prediction, xlab = \"prediction\", main = \"Basic Model\")\n\n\n\n\n\n6.1.1 How did it do?\nWe can use some utilities in the maxnetic package to help us assess the model. First, we need to create a table with two columns: label and pred. Label is the simple a vector of 0/1 indicating that the predicted value is known to be either background or presence. We already have that in our input_vector. Pred is simple the 0-1 scale predicted value. Once we have that we can craft a receiver operator characteristic curve and compute it’s AUC.\n\nx = dplyr::tibble(label = input_vector, pred = as.vector(prediction))\nplot_ROC(x, title = \"v1.0 Basic Model\")\n\n\n\n\nOverall, this is telling us that the model isn’t especially strong as a prediction tool, but it is much better than a 50-50 guess (that’s when AUC is close to 0.5, and the curve follows the light grey line). Learn more about ROC and AUC here.\n\n\n\n6.2 Predict with rasters\nWe can also predict using raster inputs using our basic model. Let’s read in rasters for each month of 2018, and then run a prediction for each month.\n\ndates = as.Date(c(\"2019-01-01\", \"2019-12-31\"))\n\nsst_path = \"data/oisst\"\nsst_db = oisster::read_database(sst_path) |&gt;\n  dplyr::arrange(date) |&gt;\n  dplyr::filter(dplyr::between(date, dates[1], dates[2]))\n  \n\nsst = sst_db |&gt;\n  oisster::compose_filename(path = sst_path) |&gt;\n  stars::read_stars(along = list(time = sst_db$date)) |&gt;\n  rlang::set_names(\"sst\")|&gt;\n  st_to_180()\n\n\nwind_path = \"data/nbs\"\nwind_db = nbs::read_database(wind_path) |&gt;\n  dplyr::arrange(date)|&gt;\n  dplyr::filter(dplyr::between(date, dates[1], dates[2]))\n\nu_wind_db = wind_db |&gt;\n  dplyr::filter(param == \"u_wind\")|&gt;\n  dplyr::filter(dplyr::between(date, dates[1], dates[2]))\nu_wind = u_wind_db |&gt;\n  nbs::compose_filename(path = wind_path) |&gt;\n  stars::read_stars(along = list(time = u_wind_db$date)) |&gt;\n  rlang::set_names(\"u_wind\") |&gt;\n  st_to_180()\n\nv_wind_db = wind_db |&gt;\n  dplyr::filter(param == \"v_wind\")|&gt;\n  dplyr::filter(dplyr::between(date, dates[1], dates[2]))\nv_wind = v_wind_db |&gt;\n  nbs::compose_filename(path = wind_path) |&gt;\n  stars::read_stars(along = list(time = v_wind_db$date)) |&gt;\n  rlang::set_names(\"v_wind\") |&gt;\n  st_to_180()\n\nOnce we have them in hand we need to bind them together. But we need to attend to common but important issue. The sst rasters and windspeed rasters have different extents. We can’t bind them together until we warp one set to match the other. Let’s warp sst to match u_wind. And then we can bind them together.\n\nsst_warped = stars::st_warp(sst, u_wind)\nx = list(sst_warped, u_wind, v_wind)\npredictors = do.call(c, append(x, list(along = NA_integer_))) \npredictors\n\nstars object with 3 dimensions and 3 attributes\nattribute(s):\n             Min.   1st Qu.     Median       Mean   3rd Qu.      Max.  NA's\nsst     -1.558928 12.528449 19.5220385 17.6005908 23.501083 29.216452 11352\nu_wind  -2.692028  1.144244  2.7007004  2.7228278  4.115177 13.148120  7612\nv_wind  -5.431324 -1.411349 -0.3202622 -0.1398384  1.106175  4.636874  7612\ndimension(s):\n     from to offset delta refsys point                    values x/y\nx       1 74 -76.38  0.25 WGS 84 FALSE                      NULL [x]\ny       1 46  46.12 -0.25 WGS 84 FALSE                      NULL [y]\ntime    1 12     NA    NA   Date    NA 2019-01-01,...,2019-12-01    \n\n\nNow we can run the prediction.\n\npred = predict(model, predictors, type = 'cloglog')\npred\n\nstars object with 3 dimensions and 1 attribute\nattribute(s):\n              Min.   1st Qu.    Median      Mean   3rd Qu.      Max.  NA's\npred  0.0001196393 0.1200618 0.2675931 0.3033565 0.4398977 0.8816952 11487\ndimension(s):\n     from to offset delta refsys point                    values x/y\nx       1 74 -76.38  0.25 WGS 84 FALSE                      NULL [x]\ny       1 46  46.12 -0.25 WGS 84 FALSE                      NULL [y]\ntime    1 12     NA    NA   Date    NA 2019-01-01,...,2019-12-01    \n\n\nSince we get a spatially mapped prediction back, we can plot it.\n\ncoast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf') |&gt;\n  sf::st_crop(pred)\n\nWarning: attribute variables are assumed to be spatially constant throughout\nall geometries\n\nplot_coast = function() {\n  plot(sf::st_geometry(coast), col = 'green', add = TRUE)\n}\nplot(pred, hook = plot_coast)\n\n\n\n\nWell, that certainly looks appealing with higher likelihood of near shore observations occurring during the warmer months.\n\n6.2.1 How did it do?\nTo compute an ROC and AUC for each month, we have a little bit of work to do. We need to extract the observations and background for each month from the prediction maps. These we can then pass to the plot_ROC() function.\n\n\n\n\n\n\nNote\n\n\n\nWe have to modify the date for each point to be the first date of each month. That’s because our predictors are monthlies.\n\n\n\ntest_obs = obs |&gt;\n  dplyr::filter(dplyr::between(date, dates[1], dates[2])) |&gt;\n  dplyr::select(dplyr::all_of(\"date\")) |&gt;\n  dplyr::mutate(date = oisster::current_month(date))\n\ntest_bkg = bkg |&gt;\n  dplyr::filter(dplyr::between(date, dates[1], dates[2])) |&gt;\n  dplyr::select(dplyr::all_of(\"date\")) |&gt;\n  dplyr::mutate(date = oisster::current_month(date))\n\ntest_input = dplyr::bind_rows(test_obs, test_bkg)\n\nx = stars::st_extract(pred, test_input, time_column = 'date') |&gt;\n  print()\n\nSimple feature collection with 1537 features and 3 fields\nGeometry type: POINT\nDimension:     XY\nBounding box:  xmin: -75.99915 ymin: 35.01635 xmax: -58.83057 ymax: 45.95233\nGeodetic CRS:  WGS 84\nFirst 10 features:\n        pred       time       date                   geometry\n1  0.2759255 2019-05-01 2019-05-01 POINT (-67.32935 40.42509)\n2  0.7245142 2019-03-01 2019-03-01 POINT (-74.41057 36.49908)\n3  0.6664676 2019-12-01 2019-12-01   POINT (-75.3994 35.9457)\n4  0.4536477 2019-06-01 2019-06-01 POINT (-75.10864 36.94806)\n5  0.6864945 2019-04-01 2019-04-01 POINT (-74.49892 36.57275)\n6  0.3105710 2019-09-01 2019-09-01   POINT (-75.5519 36.1854)\n7  0.3874695 2019-09-01 2019-09-01   POINT (-73.6245 40.3317)\n8  0.3785449 2019-04-01 2019-04-01 POINT (-69.04389 39.82132)\n9  0.7447747 2019-04-01 2019-04-01 POINT (-74.59436 36.87291)\n10 0.7447747 2019-04-01 2019-04-01 POINT (-74.45753 36.72279)\n\n\nFinally we can build a table that merges the prediction with the labels. We are going to add the name of the month to group by that.\n\ny = x |&gt;\n  dplyr::mutate(label = c(rep(1, nrow(test_obs)), rep(0, nrow(test_bkg))),\n                month = factor(format(date, \"%b\"), levels = month.abb), \n                .before = 2) |&gt;\n  sf::st_drop_geometry() |&gt;\n  dplyr::select(dplyr::all_of(c(\"month\", \"label\", \"pred\"))) |&gt;\n  dplyr::group_by(month) \n\ndplyr::count(y, month, label) |&gt;\n  print(n = 24)\n\n# A tibble: 24 × 3\n# Groups:   month [12]\n   month label     n\n   &lt;fct&gt; &lt;dbl&gt; &lt;int&gt;\n 1 Jan       0    36\n 2 Jan       1    21\n 3 Feb       0    15\n 4 Feb       1     7\n 5 Mar       0    46\n 6 Mar       1    23\n 7 Apr       0   259\n 8 Apr       1   169\n 9 May       0   182\n10 May       1   119\n11 Jun       0    73\n12 Jun       1    53\n13 Jul       0    76\n14 Jul       1    48\n15 Aug       0    46\n16 Aug       1    39\n17 Sep       0    48\n18 Sep       1    21\n19 Oct       0   102\n20 Oct       1    79\n21 Nov       0    27\n22 Nov       1    19\n23 Dec       0    15\n24 Dec       1    14\n\n\nNow how about one ROC plot for each month? Yikes! This requires a iterative approach, using group_map(), to compute the ROC for each month. We then follow with plot wrapping by the patchwork package.\n\nrocs = dplyr::group_map(y, \n  function(tbl, key){\n    maxnetic::plot_ROC(tbl, title = sprintf(\"%s, n = %i\", key$month, nrow(tbl)), \n                                            xlab = \"\", ylab = \"\")\n  })\n\npatchwork::wrap_plots(rocs, ncol = 4)\n\n\n\n\nHmmm. That’s surprising, yes? Why during the summer months does our AUC go down. In fact, at times we are predicting the likelihood of not having an observation reported. It’s hard to know what to think, but consider that we are using a model generated across all months of multiple years and it might not predict a particular month and year very well. A step toward refinement, our next step is to make 12 models, one for each month."
   }
 ]
diff --git a/modeling-01.qmd b/modeling-01.qmd
@@ -109,8 +109,9 @@ Now we can make predictions with our basic model. We'll do it two ways. First by
 Here we provide a data frame, in our case the original input data, to the `predict()` function with type `cloglog` which transform the response value into the 0-1 range.
 
 ```{r}
+#| width: "100%"
 prediction = predict(model, input_table, type = 'cloglog')
-summary(prediction)
+hist(prediction, xlab = "prediction", main = "Basic Model")
 ```
 
 #### How did it do?
@@ -121,16 +122,137 @@ We can use some utilities in the [maxnetic](https://github.com/BigelowLab/maxnet
 x = dplyr::tibble(label = input_vector, pred = as.vector(prediction))
 plot_ROC(x, title = "v1.0 Basic Model")
 ```
-Overall, this is tellign us that the model isn't especially strong as a prediction tool, but it is much better than a 50-50 guess (that's when AUC is close to 0.5, and the curve follows the light grey line).  Learn more about ROC and AUC [here](https://rviews.rstudio.com/2019/01/17/roc-curves/).
+Overall, this is telling us that the model isn't especially strong as a prediction tool, but it is much better than a 50-50 guess (that's when AUC is close to 0.5, and the curve follows the light grey line).  Learn more about ROC and AUC [here](https://rviews.rstudio.com/2019/01/17/roc-curves/).
 
 
 ### Predict with rasters
 
+We can also predict using raster inputs using our basic model. Let's read in rasters for each month of 2018, and then run a prediction for each month.
 
+```{r}
+dates = as.Date(c("2019-01-01", "2019-12-31"))
+
+sst_path = "data/oisst"
+sst_db = oisster::read_database(sst_path) |>
+  dplyr::arrange(date) |>
+  dplyr::filter(dplyr::between(date, dates[1], dates[2]))
+  
+
+sst = sst_db |>
+  oisster::compose_filename(path = sst_path) |>
+  stars::read_stars(along = list(time = sst_db$date)) |>
+  rlang::set_names("sst")|>
+  st_to_180()
+
+
+wind_path = "data/nbs"
+wind_db = nbs::read_database(wind_path) |>
+  dplyr::arrange(date)|>
+  dplyr::filter(dplyr::between(date, dates[1], dates[2]))
+
+u_wind_db = wind_db |>
+  dplyr::filter(param == "u_wind")|>
+  dplyr::filter(dplyr::between(date, dates[1], dates[2]))
+u_wind = u_wind_db |>
+  nbs::compose_filename(path = wind_path) |>
+  stars::read_stars(along = list(time = u_wind_db$date)) |>
+  rlang::set_names("u_wind") |>
+  st_to_180()
+
+v_wind_db = wind_db |>
+  dplyr::filter(param == "v_wind")|>
+  dplyr::filter(dplyr::between(date, dates[1], dates[2]))
+v_wind = v_wind_db |>
+  nbs::compose_filename(path = wind_path) |>
+  stars::read_stars(along = list(time = v_wind_db$date)) |>
+  rlang::set_names("v_wind") |>
+  st_to_180()
+```
+
+
+Once we have them in hand we need to bind them together.  But we need to attend to common but important issue.  The `sst` rasters and `windspeed` rasters have different extents. We can't bind them together until we warp one set to match the other.  Let's warp `sst` to match `u_wind`.  And then we can bind them together.
+
+```{r}
+sst_warped = stars::st_warp(sst, u_wind)
+x = list(sst_warped, u_wind, v_wind)
+predictors = do.call(c, append(x, list(along = NA_integer_))) 
+predictors
+```
+
+Now we can run the prediction.
+
+```{r}
+pred = predict(model, predictors, type = 'cloglog')
+pred
+```
+
+Since we get a spatially mapped prediction back, we can plot it.
+
+```{r}
+coast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf') |>
+  sf::st_crop(pred)
+
+plot_coast = function() {
+  plot(sf::st_geometry(coast), col = 'green', add = TRUE)
+}
+plot(pred, hook = plot_coast)
+```
+
+Well, that certainly looks appealing with higher likelihood of near shore observations occurring during the warmer months.
 
 #### How did it do?
 
+To compute an ROC and AUC for each month, we have a little bit of work to do.  We need to extract the observations and background for each month from the prediction maps.  These we can then pass to the `plot_ROC()` function.
 
+:::{.callout-note}
+We have to modify the date for each point to be the first date of each month. That's because our predictors are monthlies.
+:::
+
+```{r}
+test_obs = obs |>
+  dplyr::filter(dplyr::between(date, dates[1], dates[2])) |>
+  dplyr::select(dplyr::all_of("date")) |>
+  dplyr::mutate(date = oisster::current_month(date))
+
+test_bkg = bkg |>
+  dplyr::filter(dplyr::between(date, dates[1], dates[2])) |>
+  dplyr::select(dplyr::all_of("date")) |>
+  dplyr::mutate(date = oisster::current_month(date))
+
+test_input = dplyr::bind_rows(test_obs, test_bkg)
+
+x = stars::st_extract(pred, test_input, time_column = 'date') |>
+  print()
+```
 
+Finally we can build a table that merges the prediction with the labels. We are going to add the name of the month to group by that.
+
+```{r}
+y = x |>
+  dplyr::mutate(label = c(rep(1, nrow(test_obs)), rep(0, nrow(test_bkg))),
+                month = factor(format(date, "%b"), levels = month.abb), 
+                .before = 2) |>
+  sf::st_drop_geometry() |>
+  dplyr::select(dplyr::all_of(c("month", "label", "pred"))) |>
+  dplyr::group_by(month) 
+
+dplyr::count(y, month, label) |>
+  print(n = 24)
+```
+
+Now how about one ROC plot for each month?  Yikes!  This requires a iterative approach, using `group_map()`, to compute the ROC for each month.  We then follow with plot wrapping by the [patchwork](https://patchwork.data-imaginist.com/articles/guides/assembly.html#functional-assembly) package.
 
+```{r}
+#| width: "100%"
+rocs = dplyr::group_map(y, 
+  function(tbl, key){
+    maxnetic::plot_ROC(tbl, title = sprintf("%s, n = %i", key$month, nrow(tbl)), 
+                                            xlab = "", ylab = "")
+  })
+
+patchwork::wrap_plots(rocs, ncol = 4)
+```
+
+Hmmm.  That's surprising, yes?  Why during the summer months does our AUC go down.  In fact, at times we are predicting the likelihood of **not** having an observation reported. It's hard to know what to think, but consider that we are using a model generated across all months of multiple years and it might not predict a particular month and year very well.  A step toward refinement, our next step is to make 12 models, one for each month.
+
 
diff --git a/modeling-01_files/figure-html/unnamed-chunk-12-1.png b/modeling-01_files/figure-html/unnamed-chunk-12-1.png
diff --git a/modeling-01_files/figure-html/unnamed-chunk-15-1.png b/modeling-01_files/figure-html/unnamed-chunk-15-1.png
diff --git a/modeling-01_files/figure-html/unnamed-chunk-7-1.png b/modeling-01_files/figure-html/unnamed-chunk-7-1.png