data-remote-sensing.qmd

---
title: "Remote sensing"
editor_options: 
  chunk_output_type: console
author: 
  name: Ivan Alberto Lizarazo
  email: ializarazos@unal.edu.co 
  affiliations:
    name: Universidad Nacional de Colombia
    city: Bogotá
---

## Introduction

Remote sensing techniques, in the sense of gathering & processing of data by a device separated from the object under study, are increasingly providing an important component of the set of technologies available for the study of vegetation systems and their functioning. This is in spite that many applications only provide indirect estimations of the biophysical variables of interest [@jones2010remote].

Particular advantages of remote sensing for vegetation studies are that: (i) it is non-contact and non-destructive; and (ii) observations are easily extrapolated to larger scales. Even at the plant scale, remotely sensed imagery is advantageous as it allows rapid sampling of large number of plants [@jones2010remote].

This chapter aims at providing a conceptual & practical approach to apply remote sensing data and techniques to infer information useful for monitoring crop diseases. The structure of this chapter is divided into four sections. The first one introduces basic remote sensing concepts and provides a summary of applications of remote sensing of crop diseases. The second one illustrates a case study focused on identification of banana Fusarium wilt from multispectral UAV imagery. The third one illustrates a case study dealing with estimation of cercospora leaf spot disease on table beet. Finally, it concludes with several reflections about potential and limitations of this technology.

## Remote sensing background

### Optical remote sensing

Optical remote sensing makes use of the radiation reflected by a surface in the visible (\~400-700 nm), the near infrared (700-1300 nm) and shortwave infrared (1300-\~3000 nm) parts of the electromagnetic spectrum. Spaceborne & airborne-based remote sensing and field spectroscopy utilize the solar radiation as an illumination source. Lab spectroscopy utilizes a lamp as an artificial illumination source [@fig-RS1].

![Optical remote sensing via spaceborne sensors, field spectroscopy and laboratory spectroscopy (Adapted from [https://pages.cms.hu-berlin.de/EOL/geo_rs/index.htm](https://pages.cms.hu-berlin.de/EOL/geo_rs/)) .](imgs/RS1.png){#fig-RS1}

The proportion of the radiation reflected by a surface depends on the surface's spectral reflection, absorption and transmission properties and varies with wavelength [@fig-RS2]. These spectral properties in turn depend on the surface's physical and chemical constituents [@fig-RS2]. Measuring the reflected radiation hence allows us to draw conclusions on a surface's characteristic, which is the basic principle behind optical remote sensing.

![Reflection, absortion and transmission by a surface (left). Spectral reflectance profile of a vegetation with major factors determining the reflection (right). Source: <https://pages.cms.hu-berlin.de/EOL/geo_rs/>](imgs/RS-2.jpg){#fig-RS2}

### Vegetation spectral properties

Optical remote sensing enables the deduction of various vegetation-related characteristics, including biochemical properties (e.g., pigments, water content), structural properties (e.g., leaf area index (LAI), biomass) or process properties (e.g., light use efficiency (LUE)). The ability to deduce these characteristics depends on the ability of a sensor to resolve vegetation spectra. Hyperspectral sensors capture spectral information in hundreds of narrow and contiguous bands in the VIS, NIR and SWIR, and, thus, resolve subtle absorption features caused by specific vegetation constituents (e.g. anthocyanins, carotenoids, lignin, cellulose, proteins). In contrast, multispectral sensors capture spectral information in a few broad spectral bands and, thus, only resolve broader spectral features. Still, multispectral systems like Sentinel-2 have been demonstrated to be useful to derive valuable vegetation properties (e.g., LAI, chlorophyll).

![Vegetation spectrum in hyperspectral (ASD FielSpec4, EnMAP) and multispectral (Sentinel-2) resolution as well as characteristic spectral features caused by various constituents and processes (absorption lines shown as grey dashed lines). Source: [@hank2018]](imgs/RS-3.png){#fig-RS3}

### What measures a remote sensor?

Optical sensors/spectrometers measure the radiation reflected by a surface to a certain solid angle in the physical quantity radiance. The unit of radiance is watts per square meter per steradian (W • m-2 • sr-1) [@fig-RS4]. In other words, radiance describes the amount of energy (W) that is reflected from a surface (m-2) and arrives at the sensor in a three-dimensional angle (sr-1).

![Source: https://pages.cms.hu-berlin.de/EOL/geo_rs/](imgs/RS-4.png){#fig-RS4 fig-align="center" width="428"}

A general problem related to the use of radiance as unit of measurement is the variation of radiance values with illumination. For example, the absolute incoming solar radiation varies over the course of the day as a function of the relative position between sun and surface and so does the absolute amount of radiance measured. We can only compare measurements taken a few hours apart or on different dates when we are putting the measured radiance in relation to the incoming illumination.

The quotient between measured reflected radiance and measured incoming radiance (Radiance~reflected~ / Radiance~incoming~) is called reflectance (usually denoted as $\rho$). Reflectance provides a stable unit of measurement which is independent from illumination and is the percentage of the total measurable radiation, which has not been absorbed or transmitted.

### Hyperspectral vs.multispectral imagery

Hyperspectral imaging involves capturing and analyzing data from a large number of narrow, contiguous bands across the electromagnetic spectrum, resulting in a high-resolution spectrum for each pixel in the image. As a result, a hyperspectral camera provides smooth spectra. The spectra provided by multispectral cameras are more like stairs or saw teeth without the ability to depict acute spectral signatures [@fig-RS8].

### Vegetation Indices

A vegetation index (VI) represents a spectral transformation of two or more bands of spectral imagery into a singleband image. A VI is designed to enhance the vegetation signal with regard to different vegetation properties, while minimizing confounding factors such as soil background reflectance, directional, or atmospheric effects. There are many different VIs, including multispectral broadband indices as well as hyperspectral narrowband indices.

Most of the multispectral broadband indices make use of the inverse relationship between the lower reflectance in the red (through chlorophyll absorption) and higher reflectance in the near-infrared (through leaf structure) to provide a measure of greenness that can be indirectly related to biochemical or structural vegetation properties (e.g., chlorophyll content, LAI). The Normalized Difference Vegetation Index (NDVI) is one of the most commonly used broadband VIs:

$$NDVI = \frac{\rho_{nir} - \rho_{red} }{\rho_{nir} + \rho_{red}}$$

The interpretation of the absolute value of the NDVI is highly informative, as it allows the immediate recognition of the areas of the farm or field that have problems. The NDVI is a simple index to interpret: its values vary between -1 and 1, and each value corresponds to a different agronomic situation, regardless of the crop [@fig-RS6]

![Agronomic conditions depending on the values in a NDVI scale](imgs/RS-6.png){#fig-RS6 fig-align="center"}

## Remote sensing of crop diseases

### Detection of plant stress

One popular use of remote sensing is in diagnosis and monitoring of plant *responses* to biotic (i.e. disease and insect damage) and abiotic stress (e.g. water stress, heat, high light, pollutants) with hundreds of publications on the topic. It is worth nothing that most available techniques monitor the plant *response* rather than the stress itself. For example, with some diseases, it is common to estimate changes in canopy cover (using vegetation indices) as measures of "disease" but this measure could also be associated to water deficit [@jones2010remote]. This highlights the importance of measuring crop conditions in the field & laboratory to collect reliable data and be able to disentangle complex plant responses. Anyway, remote sensing can be used as the first step in site-specific disease control and also to phenotype the reactions of plant genotypes to pathogen attack [@lowe2017].

### Optical methods for measuring crop disease

There are a variety of optical sensors for the assessment of plant diseases. Sensors can be based only on the visible spectrum (400-700 nm) or on the visible and/or infrared spectrum (700 nm - 1mm). The latter may include near-infrared (NIR) (0.75-1.4 $μm$), short wavelength infrared (SWIR) (1.4--3 $μm$), medium wavelength infrared (MWIR) (3-8 $μm$), or thermal infrared (8-15 $μm$) [@fig-RS8]. Sensors record either imaging or non imaging (i.e average) spectral radiance values which need to be converted to reflectance before conducting any crop disease monitoring task.

![source: @delponte2024](imgs/RS-8.png){#fig-RS8}

In a recent chapter of Agrio's Plant Pathology, @delponte2024 highlights the importance of understanding the basic principles of the interaction of light with plant tissue or the plant canopy as a crucial prerrequisite for the analysis and interpretation for disease assessment. When a plant is infected, there are changes to the phisiology and biochemistry of the host, with the eventual development of disease symptoms and/or signs of the pathogen which may be accompanied by structural and biochemical changes that affect absorbance, transmittance, and reflectance of light [@fig-RS10].

![source: @delponte2024](imgs/RS-10.png){#fig-RS9}

### Scopes of disease sensing

The quantification of typical disease symptoms (disease severity) and assessment of leaves infected by several pathogens are relatively simple for imaging systems but may become a challenge for nonimaging sensors and sensors with inadequate spatial resolution [@oerke2020]. Systematic monitoring of a crop by remote sensors can allow farmers to take preventive actions if infections are detected early.  Remote sensing sensors & processing techniques need to be carefully selected to be capable of (a) detecting a deviation in the crop's health status brought about by pathogens, (b) identifying the disease, and (c) quantifying the severity of the disease.  Remote sensing can also be effectively used in (d) food quality control [@fig-RS10].

![Source: [@oerke2020]](imgs/RS-9.png){#fig-RS10}

### Monitoring plant diseases

Sensing of plants for precision disease control is done in large fields or greenhouses where the aim is to detect the occurrence of diseases at the early stages of epidemics, i.e., at low symptom frequency. @lowe2017 reviewed hyperspectral imaging of plant diseases, focusing on early detection of diseases for crop monitoring. They report several analysis techniques successfully used for the detection of biotic and abiotic stresses with reported levels of accuracy higher than 80%.

| Technique                              | Plant (stress)                  |
|----------------------------------------|---------------------------------|
| Quadratic discriminant analysis (QDA)  | Wheat (yellow rust)             |
|                                        | Avacado (laurel wilt)           |
| Decision tree (DT)                     | Avacado (laurel wilt)           |
|                                        | Sugarbeet (cerospora leaf spot) |
|                                        | Sugarbeet (powdery mildew)      |
|                                        | Sugarbeet (leaf rust)           |
| Multilayer perceptron (MLP)            | Wheat (yellow rust)             |
| Partial least square regression (PLSR) | Celery (sclerotinia rot)        |
| Raw                                    |                                 |
| Savitsky-Golay 1st derivative          |                                 |
| Savitsky-Golay 2nd derivative          |                                 |
| Partial least square regression (PLSR) | Wheat (yellow rust)             |
| Fishers linear determinant analysis    | Wheat (aphid)                   |
|                                        | Wheat (powdery mildew)          |
|                                        | Wheat (powdery mildew)          |
| Erosion and dilation                   | Cucumber (downey mildew)        |
| Spectral angle mapper (SAM)            | Sugarbeet (cerospora leaf spot) |
|                                        | Sugarbeet (powdery mildew)      |
|                                        | Sugarbeet (leaf rust)           |
|                                        | Wheat (head blight)             |
| Artificial neural network (ANN)        | Sugarbeet (cerospora leaf spot) |
|                                        | Sugarbeet (powdery mildew)      |
|                                        | Sugarbeet (leaf rust)           |
| Support vector machine (SVM)           | Sugarbeet (cerospora leaf spot) |
|                                        | Sugarbeet (powdery mildew)      |
|                                        | Sugarbeet (leaf rust)           |
|                                        | Barley (drought)                |
| Spectral information divergence (SID)  | Grapefruit                      |
|                                        | (canker, greasy spot, insect    |
|                                        | damage, scab, wind scar)        |

: Statistical techniques used to detect both biotic and abiotic stresses in crops. Source: @lowe2017

@lowe2017 state that remote sensing of diseases under production conditions is challenging because of variable environmental factors and crop-intrinsic characteristics, e.g., 3D architecture, various growth stages, variety of diseases that may occur simultaneously, and the high sensitivity required to reliably perceive low disease levels suitable for decision-making in disease control. The use of less sensitive systems may be restricted to the assessment of crop damage and yield losses due to diseases.

### UAV applications for plant disease detection and monitoring

@kouadio2023 undertook a systematic quantitative literature review to summarize existing literature in UAV-based applications for plant disease detection and monitoring. Results reveal a global disparity in research on the topic, with Asian countries being the top contributing countries. World regions such as Oceania and Africa exhibit comparatively lesser representation. To date, research has largely focused on diseases affecting wheat, sugar beet, potato, maize, and grapevine [@fig-RS11]. Multispectral, red-green-blue, and hyperspectral sensors were most often used to detect and identify disease symptoms, with current trends pointing to approaches integrating multiple sensors and the use of machine learning and deep learning techniques. The authors suggest that future research should prioritize (i) development of cost-effective and user-friendly UAVs, (ii) integration with emerging agricultural technologies, (iii) improved data acquisition and processing efficiency (iv) diverse testing scenarios, and (v) ethical considerations through proper regulations.

![Source: @kouadio2023](imgs/RS-12.png){#fig-RS11 width="488"}

## Disease detection

This section illustrates the use of unmanned aerial vehicle (UAV) remote sensing imagery for identifying banana wilt disease. Fusarium wilt of banana, also known as "banana cancer", threatens banana production areas worldwide. Timely and accurate identification of Fusarium wilt disease is crucial for effective disease control and optimizing agricultural planting structure [@pegg2019].

A common initial symptom of this disease is the appearance of a faint pale yellow streak at the base of the petiole of the oldest leaf. This is followed by leaf chlorosis which progresses from lower to upper leaves, wilting of leaves and longitudinal splitting of their bases. Pseudostem splitting of leaf bases is more common in young, rapidly growing plants [@pegg2019][@fig-RSfw].

![Cavendish plant affected by Race 1 Foc (D. Peasley). Source: [@pegg2019]](https://www.frontiersin.org/files/Articles/469624/fpls-10-01395-HTML/image_m/fpls-10-01395-g006.jpg){#fig-RSfw fig-align="center"}

@ye2020 made publicly available experimental data [@HuichunYE2023] on wilted banana plants collected in a banana plantation located in Long'an County, Guangxi (China). The data set includes UAV multispectral reflectance data and ground survey data on the incidence of banana wilt disease. The paper by @ye2020 reports that the banana Fusarium wilt disease can be easily identified using several vegetation indices (VIs) obtained from this data set. Tested VIs include green chlorophyll index (CIgreen), red-edge chlorophyll index (CIRE), normalized difference vegetation index (NDVI), and normalized difference red-edge index (NDRE). The dataset can be downloaded from [here](https://www.scidb.cn/en/detail?dataSetId=8d77781a1d754db4b8842708a69c2c22).

### Software setup

Let's start by cleaning up R memory:

```{r}
rm(list=ls())
```

Then, we need to install several packages (if they are not installed yet):

```{r}
list.of.packages <- c("terra", 
                      "tidyterra", 
                      "stars", 
                      "sf", 
                      "leaflet", 
                      "leafem", 
                      "dplyr", 
                      "ggplot2", 
                      "tidymodels")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
```

Now, let's load all the required packages:

```{r}
#| warning: false
#| message: false
library(terra)
library(tidyterra)
library(stars)
library(sf)
library(leaflet)
library(leafem)
library(dplyr)
library(ggplot2)
library(tidymodels)
```

### Reading the dataset

Next code supposes you have already downloaded the @HuichunYE2023 dataset and unzipped its content under the *data/banana_data* directory.

### File formats

Let's list the files under each subfolder:

```{r}
list.files("data/banana_data/1_UAV multispectral reflectance")
```

Note that the *.tif* file contains an orthophotomosaic of surface reflectance. It was created from UAV images taken with a *Micasense Red Edge M* camera which has five narrow spectral bands: Blue (465--485 nm), green (550--570 nm), red (653--673 nm), red edge (712--722 nm), and near-infrared (800--880 nm). We assume here that those images have been radiometrically and geometrically corrected.

```{r}
list.files("data/banana_data/2_Ground survey data of banana Fusarium wilt")
```

This is shapefile with 80 points where the plant health status was collected in same date as the images.

```{r}
list.files("data/banana_data/3_Boundary of banana planting region")
```

This is a shapefile with one polygon representing the boundary of the study area.

### Read the orthomosaic and the ground data

Now, let's read the orthomosaic using the *terra* package:

```{r}
# Open the tif 
tif <- "data/banana_data/1_UAV multispectral reflectance/UAV multispectral reflectance.tif"


rrr <- terra::rast(tif)
```

Let's check what we get:

```{r}
rrr
```

Note that this is a 5-band multispectral image with 8 cm pixel size.

Now, let's read the ground data:

```{r}
shp <- "data/banana_data/2_Ground survey data of banana Fusarium wilt/Ground_survey_data_of_banana_Fusarium_wilt.shp"
ggg <- sf::st_read(shp)
```

What we got?

```{r}
ggg
```

Note that the attributes are in Chinese language. It seems that we will need to do several changes.

### Visualizing the data

As the orthomosaic is too heavy to visualize, we will need a coarser version of it. Let's use the *terra* package for doing it.

```{r}
rrr8 <- terra::aggregate(rrr, 8)
#terra <- resample(elev, template, method='bilinear')
```

Let's check the output:

```{r}
rrr8
```

Note that the pixel size of the aggregated raster is 64 cm. Now, in order to visualize the ground points, we will need a color palette:

```{r}
pal <- colorFactor(
  palette = c('green',  'red'),
  domain = ggg$样点类型
)
```

Then, we will use the *leaflet* package to plot the new image and the ground points:

```{r}
#| warning: false
#| message: false
leaflet(data = ggg) |>
  addProviderTiles("Esri.WorldImagery") |>
  addRasterImage(rrr8) |>
  addCircleMarkers(~x_经度, ~y_纬度,
    radius = 5,
    label = ~样点类型,
    fillColor = ~pal(样点类型),  
    fillOpacity = 1,
    stroke = F)
```

### Extracting image values at sampled points

Now we will extract raster values at point locations using the `st_extract()` function from the {stars} library. It is expected that a value per band is extracted at each point.

We need to convert the *raster* object into a *stars* object:

```{r}
sss <- st_as_stars(rrr)
```

What we got?

```{r}
sss
```

Before conducting the extraction task, it is advisable to collect band values not at a single pixel but at a small window (e.g. 3x3 pixels). Thus, we will start creating 20cm buffers at each site:

```{r}
poly <- st_buffer(ggg, dist = 0.20)
```

Now, the extraction task:

```{r}
# Extract the median value per polygon
buf_values <- aggregate(sss, poly, FUN = median) |>
  st_as_sf()
```

What we got:

```{r}
buf_values
```

Note that names of bands are weird:

```{r}
names(buf_values)
```

Let's rename band values:

```{r}
buf_values |> rename(blue = "UAV multispectral reflectance.tif.V1",
                      red = "UAV multispectral reflectance.tif.V2",
                      green = "UAV multispectral reflectance.tif.V3",
                      redge = "UAV multispectral reflectance.tif.V4",
                      nir = "UAV multispectral reflectance.tif.V5") -> buf_values2
```

Now, we got shorter names per band:

```{r}
buf_values2
```

### Computing vegetation indices

@ye2020 used the following indices [@fig-RS12]:

![](imgs/RS_VIs.png){#fig-RS12 fig-align="center"}

Thus, we will compute several of those indices:

```{r}
buf_indices <-  buf_values2 |> 
  mutate(ndvi = (nir - red) / (nir+red),
         ndre = (nir - redge) / (nir+redge),
          cire = (nir) / (redge-1),
          sipi = (nir - blue) / (redge-red)
  ) |> select(ndvi, ndre, cire, sipi)
```

What we got:

```{r}
buf_indices
```

Note that the health status is missing in *buf_indices*. Therefore, we will need to use a spatial join to link such status:

```{r}
samples <- st_join(
  ggg,
  buf_indices,
  join = st_intersects)
```

Let's check the output:

```{r}
samples
```

It seems we succeeded.

```{r}
unique(samples$样点类型)
```

Let's check it:

```{r}
samples
```

Now, we will replace the Chinese words for English words:

```{r}
samples |> 
  mutate(score = ifelse(样点类型 == '健康植株', 'healthy', 'wilted')) |>
  rename(east = x_经度,
         north = y_纬度 ) |>
  select(OBJECTID,score, ndvi, ndre, cire, sipi) -> nsamples
```

Let's check the output:

```{r}
nsamples
```

As we will not intend to use the geometry in our model, we can remove it:

```{r}
st_geometry(nsamples) <- NULL
```

Let's check the output:

```{r}
nsamples
```

As our task is a binary classification (i.e any site can be either healthy or wilted), the variable to estimate is a *factor* (not a character).

Let's change the data type of such variable:

```{r}
nsamples$score = as.factor(nsamples$score) 
```

Let's check the result:

```{r}
nsamples
```

A simple summary of the extracted data can be useful:

```{r}
nsamples |>
  group_by(score) |>
  summarize(n())
```

This mean the dataset is balanced which is very good.

### Saving the extracted dataset

Now, let's save the *nsamples* object. Just in case R crashes due to lack of memory.

```{r}
#uncomment if needed
#st_write(nsamples, "./banana_data/nsamples.csv", overwrite=TRUE)
```

### Classification of Fusarium wilt using machine learning (ML)

The overall process to classify the crop disease under study will be conducted using the [*tidymodels* framework](https://www.tidymodels.org/) which is an extension of the *tidyverse* suite. It is especially focused towards providing a generalized way to define, run and optimize ML models in R.

#### Exploratory analysis

As a first step in modeling, it's always a good idea to visualize the data. Let's start with a *boxplot* to displays the distribution of a vegetation index. It visualizes five summary statistics (the median, two hinges and two whiskers), and all "outlying" points individually.

```{r}
p <- ggplot(nsamples, aes(score, ndre))+
  r4pde::theme_r4pde()
p + geom_boxplot()
```

Next, we will do a scatterplot to visualize the indices NDRE and CIRE:

```{r}
ggplot(nsamples) +
  aes(x = ndre, y = cire, color = score) +
  geom_point(shape = 16, size = 4) +
  labs(x = "NDRE", y = "CIRI") +
  r4pde::theme_r4pde() +
  scale_color_manual(values = c("#71b075", "#ba0600"))
```

#### Splitting the data

Next step is to divide the data into a training and a test set. The `set.seed()` function can be used for reproducibility of the computations that are dependent on random numbers. By default, the training/testing split is 0.75 to 0.25.

```{r}
set.seed(42)
data_split <- initial_split(data = nsamples)
data_train <- training(data_split)
data_test <- testing(data_split)
```

Let's check the result:

```{r}
data_train
```

#### Defining the model

We will use a logistic regression which is a simple model. It may be useful to have a look at [this explanation](https://mlu-explain.github.io/logistic-regression/) of such a model.

```{r}
spec_lr <-
logistic_reg() |> 
  set_engine("glm") |> 
  set_mode("classification")
```

#### Defining the recipe

The `recipe()` function to be used here has two arguments:

-   A formula. Any variable on the left-hand side of the tilde (\~) is considered the model outcome (here, outcome). On the right-hand side of the tilde are the predictors. Variables may be listed by name, or you can use the dot (.) to indicate all other variables as predictors.

-   The data. A recipe is associated with the data set used to create the model. This will typically be the training set, so `data = data_train` here.

```{r}
recipe_lr <-
  recipe(score ~ ., data_train) |>
  add_role(OBJECTID, new_role = "id") |>
  step_zv(all_predictors()) |>
  step_corr(all_predictors())
```

#### Evaluating model performance

Next, we need to specify what we would like to see for determining the performance of the model. Different modelling algorithms have different types of metrics. Because we have a binary classification problem (healthy vs. wilted classification), we will chose the AUC - ROC evaluation metric here.

#### Combining model and recipe into a workflow

We will want to use our recipe across several steps as we train and test our model. We will:

-   Process the recipe using the training set: This involves any estimation or calculations based on the training set. For our recipe, the training set will be used to determine which predictors should be converted to dummy variables and which predictors will have zero-variance in the training set, and should be slated for removal.

-   Apply the recipe to the training set: We create the final predictor set on the training set.

-   Apply the recipe to the test set: We create the final predictor set on the test set. Nothing is recomputed and no information from the test set is used here; the dummy variable and zero-variance results from the training set are applied to the test set.

To simplify this process, we can use a model workflow, which pairs a model and recipe together. This is a straightforward approach because different recipes are often needed for different models, so when a model and recipe are bundled, it becomes easier to train and test workflows. We'll use the workflows package from *tidymodels* to bundle our model with our recipe.

Now we are ready to setup our complete modelling workflow. This workflow contains the model specification and the recipe.

```{r}
wf_bana_wilt <-
  workflow(
    spec = spec_lr,
    recipe_lr
    )

wf_bana_wilt
```

#### Fitting the logistic regression model

Now we use the workflow previously created to fit the model on our training data. We use the training partition of the data.

```{r}
#| warning: false
#| message: false
fit_lr <- wf_bana_wilt  |> 
  fit(data = data_train)
```

Let's check the output:

```{r}
fit_lr
```

Now, we will use the fitted model to estimate health status in the training data:

```{r}
rf_training_pred <- 
  predict(fit_lr, data_train) |> 
  bind_cols(predict(fit_lr, data_train, type = "prob")) |> 
  # Add the true outcome data back in
  bind_cols(data_train |> 
              select(score))
```

What we got?

```{r}
rf_training_pred
```

Let's estimate the training accuracy:

```{r}
rf_training_pred |> # training set predictions
  accuracy(truth = score, .pred_class) -> acc_train
acc_train
```

The accuracy of the model on the training data is 0.85 which is above 0.5 (mere chance). This basically means that the model was able to learn predictive patterns from the training data. To see if the model is able to generalize what it learned when exposed to new data, we evaluate the model on our hold-out (or so-called test data). We created a test dataset when splitting the data at the start of the modelling.

#### Evaluating the model on test data

Now, we will use the fitted model to estimate health status in the testing data:

```{r}
lr_testing_pred <- 
  predict(fit_lr, data_test) |> 
  bind_cols(predict(fit_lr, data_test, type = "prob")) |> 
  bind_cols(data_test |> select(score))
```

What we got:

```{r}
lr_testing_pred
```

Let's compute the testing accuracy:

```{r}
lr_testing_pred |>                   # test set predictions
  accuracy(score, .pred_class)
```

The resulting accuracy is similar to the accuracy on the training data. It is good for a first go and a relatively simple classification model.

```{r}
## Let's plot the AUC-ROC 
lr_testing_pred |> 
  roc_curve(truth = score, .pred_wilted, event_level="second") |> 
  mutate(model = "Logistic Regression") |>
  autoplot()
```

### Conclusions

In this section, we trained and tested a logistic regression model (LGM) using four spectral indices as predictor variables (i.e. NDVI, NDRE, CIRE and SIPI). Compare this section results, in terms of equation and accuracy, with the individual LGMs tested by [@ye2020][@fig-RS13].

![](imgs/RS_logregmodels.png){#fig-RS13 fig-align="center"}

Note that we have not tested other ML algorithms. But there are a lot of them available from the *tidymodels* framework (e.g. random forests, support vector machines, gradient boosting machines).

To conclude, this section illustrated how to use VIs derived from UAV-based multispectral imagery and ground data to develop an identification model for detecting banana Fusarium wilt. The results showed that a simple logistic regression model is able to identify Fusarium wilt of banana from several VIs with a good accuracy. However, before going too optimistic, I would suggest to study the @ye2020 paper and critically evaluate their experiment design, methods and results.

## Disease quantification

### Introduction

This section illustrates the use of UAV multispecral imagery for estimating the severity of cercospora leaf spot (CLS) disease in table beet, the most destructive fungal disease of table beet [@skaracis2010; @tan2023]. The CLS disease causes rapid defoliation and significant crop loss may occur through the inability to harvest with top-pulling machinery. It may also render the produce unsaleable for the fresh market [@skaracis2010].

@Saif2024 recently published on [Mendeley data](https://data.mendeley.com/datasets/v9b7rwrwx9/1) UAV imagery & ground truth CLS data collected a several table beef plots at Cornell AgriTech, Geneva, New York, USA. Note that, in this study, UAV multispectral images were collected using a Micasense Red Edge camera, similar to the one used in the case study for the detection of Fusarium wilt on banana. I have not found any scientific paper estimating CLS severity from this dataset. However, in @saif2023, a similar dataset was used to forecast table beet root yield at Cornell Agritech. While it is just a guess, it is worth visualizing the plots in the latter study.

![Color mosaic of third flight hyperspectral image. The visualization bands are the following: red (639 nm), green (550 nm), and blue (470 nm). The crops in the red boxes in the figure were the plots under study. Source: [@saif2023]](imgs/RS_beetyield.png){#fig-RSbeetyield fig-align="center"}

This section aims at estimating CLS leaf severity using vegetation indices (VIs) derived from the multispectral UAV imagery as spectral covariates. The section comprises two parts: (i) Part 1 creates a spectral covariate table to summarize the complete UAV imagery & ground truth data; and (ii) Part 2 trains and tests a machine learning model to estimate CLS severity.

### CLS severity on leaves

It is very important to visualize how different levels of table beet CLS disease severity look in RGB color leaf images. The figure below is a standard area diagram set (SADs) that is used by raters during the assessment of visual severity to increase their accuracy and reliability of the estimates.

![Standard area diagram set for the severity of table beet cercospora leaf spot. Source: [@delponte2019]](imgs/RS_cls_sev.png){#fig-RS-cls_sev fig-align="center"}

### Software setup

Let's start by cleaning up R memory:

```{r}
rm(list=ls())
```

Then, we need to install several packages (if they are not installed yet):

```{r}
#| warning: true
list.of.packages <- c("readr","terra", "tidyterra", "stars", "sf", "leaflet", "leafem", "dplyr", "ggplot2", "tidymodels")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
```

Now, let's load all the required packages:

```{r}
#| warning: false
#| message: false

library(readr)
library(terra)
library(tidyterra)
library(stars)
library(sf)
library(leaflet)
library(leafem)
library(dplyr)
library(ggplot2)
library(tidymodels)
```

### Read the dataset

The following code assumes you have already downloaded the [dataset](https://data.mendeley.com/datasets/v9b7rwrwx9/1) and unzipped its content under the *data/cercospora_data* directory. What files are in that folder?

```{r}
list.files("data/cercospora_data")
```

Note that there is one CLS_DS.csv file with the CLS ground data and two folders with the UAV multispectral images. We can infer that 2021, 2022, 2023 refer to the image acquisition years. At this point, it is very important to check the README file. I will summarize the following points:

#### File naming convention

Each image file is named according to the plot number and the date of capture, using the format plt_rYYYYMMDD, where:

-   **'plt'** stands for the plot number.
-   **'YYYYMMDD'** represents the date on which the image was captured (year, month, day).

For example, the file name 5_r20210715 corresponds to an image taken on July 15, 2021, from study plot 5.

#### CLS Severity Data

File named CLS_DS contain visual assessments of CLS disease severity noted for each plot.

### Inspect the format of each file

Let's list the first images under an image folder:

```{r}
list.files("data/cercospora_data/multispec_2021_2022/")[1:15]
```

Note that, for 2021, there are six dates of image acquisition: 20210707, 20210715, 20210720,20210726, 20210802,20210825

### Read the orthomosaics

Now, let's read several plot images using the *terra* package:

```{r}
# Open a tif collected on study plot 5 on  7th July 2021
tif1 <- "data/cercospora_data/multispec_2021_2022/1_r20210707.tif"
tif2 <- "data/cercospora_data/multispec_2021_2022/2_r20210707.tif"
tif3 <- "data/cercospora_data/multispec_2021_2022/3_r20210707.tif"
tif4 <- "data/cercospora_data/multispec_2021_2022/4_r20210707.tif"
tif5 <- "data/cercospora_data/multispec_2021_2022/5_r20210707.tif"
tif6 <- "data/cercospora_data/multispec_2021_2022/6_r20210707.tif"
tif7 <- "data/cercospora_data/multispec_2021_2022/7_r20210707.tif"
tif8 <- "data/cercospora_data/multispec_2021_2022/8_r20210707.tif"
tif9 <- "data/cercospora_data/multispec_2021_2022/9_r20210707.tif"
tif10 <- "data/cercospora_data/multispec_2021_2022/10_r20210707.tif"
```

It may be convenient to increase image pixel size (to make the raster lighter):

```{r}
p01 <- terra::rast(tif1) %>% aggregate(20)
p02 <- terra::rast(tif2) %>% aggregate(20)
p03 <- terra::rast(tif3) %>% aggregate(20)
p04 <- terra::rast(tif4) %>% aggregate(20)
p05 <- terra::rast(tif5) %>% aggregate(20)
p06 <- terra::rast(tif6) %>% aggregate(20)
p07 <- terra::rast(tif7) %>% aggregate(20)
p08 <- terra::rast(tif8) %>% aggregate(20)
p09 <- terra::rast(tif9) %>% aggregate(20)
p10 <- terra::rast(tif10) %>% aggregate(20)
```

Let's check what we get:

```{r}
p01
```

Note that each image has 5 bands with \~21 cm pixel size.

As the images have been split in plots, it may be useful to "mosaic" them.

```{r}

# with many SpatRasters, make a SpatRasterCollection from a list
rlist <- list(p01, p02, p03, p04, p05, p06, p07, p08, p09, p10)
rsrc <- sprc(rlist)

m <- merge(rsrc)
```

What we got?

```{r}
m
```

Let's know band names:

```{r}
names(m)
```

We will rename band names:

```{r}
# Rename
m2 <- m %>%
  rename(blue = "1_r20210707_1", green = "1_r20210707_2", 
         red = "1_r20210707_3", redge= "1_r20210707_4",
         nir = "1_r20210707_5")
```

Let's get a summary of the mosaic:

```{r}
summary(m2)
```

```{r}
hist(m2)
```

Let's visualize the mosaic:

```{r}
ggplot() +
  geom_spatraster(data = m2) +
  facet_wrap(~lyr, ncol = 5) +
   r4pde::theme_r4pde(font_size = 10)+
  scale_fill_whitebox_c(
    palette = "muted",
    labels = scales::label_number(suffix = "%"),
    n.breaks = 10,
    guide = guide_legend(reverse = TRUE)
  ) +
  labs(
    fill = "",
    title = "UAV multispectral mosaic",
    subtitle = "07.07.2021"
  )
```

Let's read all images collected at a later date:

```{r}
# Open all images collected on  25th August 2021
tif1 <- "data/cercospora_data/multispec_2021_2022/1_r20210825.tif"
tif2 <- "data/cercospora_data/multispec_2021_2022/2_r20210825.tif"
tif3 <- "data/cercospora_data/multispec_2021_2022/3_r20210825.tif"
tif4 <- "data/cercospora_data/multispec_2021_2022/4_r20210825.tif"
tif5 <- "data/cercospora_data/multispec_2021_2022/5_r20210825.tif"
tif6 <- "data/cercospora_data/multispec_2021_2022/6_r20210825.tif"
tif7 <- "data/cercospora_data/multispec_2021_2022/7_r20210825.tif"
tif8 <- "data/cercospora_data/multispec_2021_2022/8_r20210825.tif"
tif9 <- "data/cercospora_data/multispec_2021_2022/9_r20210825.tif"
tif10 <- "data/cercospora_data/multispec_2021_2022/10_r20210825.tif"
```

```{r}
p01 <- terra::rast(tif1) %>% aggregate(20)
p02 <- terra::rast(tif2) %>% aggregate(20)
p03 <- terra::rast(tif3) %>% aggregate(20)
p04 <- terra::rast(tif4) %>% aggregate(20)
p05 <- terra::rast(tif5) %>% aggregate(20)
p06 <- terra::rast(tif6) %>% aggregate(20)
p07 <- terra::rast(tif7) %>% aggregate(20)
p08 <- terra::rast(tif8) %>% aggregate(20)
p09 <- terra::rast(tif9) %>% aggregate(20)
p10 <- terra::rast(tif10) %>% aggregate(20)
```

```{r}
# with many SpatRasters, make a SpatRasterCollection from a list
rlist <- list(p01, p02, p03, p04, p05, p06, p07, p08, p09, p10)
rsrc <- sprc(rlist)

mm <- merge(rsrc)
```

```{r}
names(mm)
```

```{r}
# Rename
mm2 <- mm %>%
  rename(blue = "1_r20210825_1", green = "1_r20210825_2", 
         red = "1_r20210825_3", redge= "1_r20210825_4",
         nir = "1_r20210825_5")
```

Now, a visualization task:

```{r}
ggplot() +
  geom_spatraster(data = mm2) +
  facet_wrap(~lyr, ncol = 5) +
  r4pde::theme_r4pde(font_size = 10)+
  scale_fill_whitebox_c(
    palette = "muted",
    labels = scales::label_number(suffix = "%"),
    n.breaks = 10,
    guide = guide_legend(reverse = TRUE)
  ) +
  labs(
    fill = "",
    title = "UAV multispectral mosaic",
    subtitle = "08.25.2021"
  )
```

### Reading severity data

Now, let's read the ground CLS data:

```{r}
file <- "data/cercospora_data/CLS_DS.csv"
cls <- readr::read_csv(file, col_names = TRUE, show_col_types = FALSE)
```

What we got?

```{r}
cls
```

Note that columns D0 to D6 refer to the disease severity recorded in percentage at different dates for each study plot. Note also that this object is structured in *wide format* (i.e. each row represents a single plot for each year).

```{r}
unique(cls$Plot)
```

Let's know the values stored in the *year* attribute:

```{r}
unique(cls$year)
```

### Preparing a set of spectral covariables per plot & date

The 2021 data collection dates are 20210707, 20210715, 20210720, 20210802,and 20210825.

Let's start creating a list of images per date for 2021:

```{r}
# Create a list .tif files collected at each date:
# These are the DXfiles
D0files <- Sys.glob("data/cercospora_data/multispec_2021_2022/*_r20210707.tif")
D1files <- Sys.glob("data/cercospora_data/multispec_2021_2022/*_r20210715.tif")
D2files <- Sys.glob("data/cercospora_data/multispec_2021_2022/*_r20210720.tif")
D3files <- Sys.glob("data/cercospora_data/multispec_2021_2022/*_r20210802.tif")
D4files <- Sys.glob("data/cercospora_data/multispec_2021_2022/*_r20210825.tif")
```

Now, define date variables:

```{r}
# These are the DX variables
(D0 <- as.Date('7/7/2021',format='%m/%d/%Y'))
(D1 <- as.Date('7/15/2021',format='%m/%d/%Y'))
(D2 <- as.Date('7/20/2021',format='%m/%d/%Y'))
(D3 <- as.Date('8/02/2021',format='%m/%d/%Y'))
(D4 <- as.Date('8/25/2021',format='%m/%d/%Y'))
```

The following block of code should be executed five times (one per each *DXfiles* & each *DX* variable).

The code reads each image included in a given *DXfiles*, computes several vegetation indices (VIs) for such image, and obtain the average value of original bands & VIs per each. All values are stored in a *vector* object.

```{r}
# Read all files for each date
lista <- lapply(D0files, function(x) rast(x)) 
# Compute global statistics
output <- vector("double", 5)  # 1. output
for (i in seq_along(lista)) {            # 2. sequence
  img <- lista[[i]]
  r <- clamp(img, 0, 1)
  ndvi <- (r[[5]]-r[[3]])/(r[[5]]+r[[3]])
  ndre = (r[[5]]-r[[4]]) / (r[[5]]+r[[4]])
  cire = (r[[5]]) / (r[[4]]-1)
  sipi = (r[[5]] - r[[1]]) / (r[[4]]-r[[3]])
  min <- minmax(ndvi)[1]
  max <- minmax(ndvi)[2]
  avg <- global(ndvi, fun="mean", na.rm=TRUE)[,1]
  #dev <- global(ndvi, fun="std")
  #mndvi <- (ndvi-min)/(max-min)
  mndvi <- 2 + log((ndvi-avg)/(max-min))
  nimg <- c(r,ndvi, ndre, cire, sipi, mndvi)
  output[i] <- global(nimg, fun="mean", na.rm=TRUE) 
  #output[i] <- global(nimg, mean, na.rm=TRUE) 
}
```

The following block of code combine all vectors produced previously into a matrix as a previous step to put all data into a dataframe with meaningful colummn names. It also add colummns storing each plot ID as well as the corresponding data. Note the the code also needs to be executed five times (one per each *DX* variable).

```{r}
#combine all vectors into a matrix
mat <- do.call("rbind",output) #combine all vectors into a matrix
#convert matrix to data frame
df <- as.data.frame(mat)
#specify column names
colnames(df) <- c('blue', 'green', 'red', 'redge', 'nir', 'ndvi', 
                  'ndre', 'cire', 'sipi', 'mndvi')
date = D0
plot <- seq(1:40)
date <- rep(date,40)
newdf <- cbind(df,plot,date)
```

What we got?

```{r}
newdf
```

The following block of code writes to disk a *DX_2021.csv* file. Note that the *append* parameter is set to FALSE in order to overwrite any existing data.

```{r}
write_csv(
  newdf,
  "data/cercospora_data/D0_2021.csv",
  na = "NA",
  col_names = TRUE,
  append = FALSE
)
```

### Put all covariate data in a single object

Now, we will read all the recently created csv files and put all covariate data values into a single object.

Next code obtains a list with the path of every csv file matching the D\*\_2021.csv expression:

```{r}
(csv_files <- Sys.glob("data/cercospora_data/D*_2021.csv"))
```

Now, we will read all files in the list:

```{r}
(df.2021 <- do.call(rbind,lapply(csv_files,read.csv)))
```

Then, we will convert the dataframe into a tibble:

```{r}
covar.2021 <- tibble(df.2021)
```

Let's check what we got:

```{r}
covar.2021
```

Now, we will need to join the CLS severity data to the spectral covariates data.

Let's remind how is structured the CLS data:

```{r}
cls
```

Note that the covar.2021 object has a date attribute in a single column and that the cls object has different columns represent dates (i.e. D0 to D5). This means that we need to change the *cls* object structure to be able to join its attributes to the *covar.2021* object. In technical terms, our task is to convert the cls table from *wide* format into *long* format.

We will use the *pivot_longer* function from the *dplyr* package:

```{r}
cls %>%
  pivot_longer(
    cols = starts_with("D"),
    names_to = "date",
    values_to = "severity",
    values_drop_na = TRUE
  ) -> long_cls
```

Now, we will do an additional step, we will replace qualitative date values (D0, D1, ...) for actual date values.

```{r}
change <- tibble(old = c("D0", "D1", "D2", "D3", "D4"),
                 new = c("2021-07-07", "2021-07-15", "2021-07-20",
                          "2021-08-02","2021-08-25"))
```

Let's check what we got:

```{r}
change
```

Now, let's do a left join to produce the changes:

```{r}
(ncls <- left_join(long_cls, change, by = c(date = "old")))
```

Finally, we can apply a join:

```{r}
ncovar.2021 <- left_join(ncls, covar.2021, by = c("Plot" = "plot", "new" = "date"))
```

```{r}
ncovar.2021
```

### Change severity percentage to severity categories

Our final step is to convert percentage severity values into ordinal severity levels.

First, let's check the severity histogram

```{r}
#uncomment if you need to the read the file
#ncovar.2021 <- read_csv("./data/ncovar_2021.csv", show_col_types = FALSE)
```

```{r}
hist(ncovar.2021$severity)
```

Next code produces four levels of CLS severity:

::: l-body-outset
| Severity \[%\] | Category |
|----------------|----------|
| 0 - 1          | 0        |
| 1 - 10         | 1        |
| 10 - 20        | 2        |
| 20 - 30        | 3        |
| 30 - 40        | 4        |
| 40 - 50        | 5        |
| 50 - 100       | 6        |
:::

```{r}
fcovar.2021 <- ncovar.2021 %>% mutate(category=cut(severity, breaks=c(0,1,  10, 20, 30,40, 50, 100), labels=c(0,1,2,3,4,5,6)))
```

What we got?

```{r}
summary(fcovar.2021$category)
```

Note that the distribution of values at each disease level is not balanced.

### Write 2021 csv files

```{r}
write_csv(
  ncovar.2021,
  "data/cercospora_data/ncovar_2021.csv",
  na = "NA",
  col_names = TRUE,
  append = FALSE
)
```

```{r}
write_csv(
  fcovar.2021,
  "data/cercospora_data/fcovar_2021.csv",
  na = "NA",
  col_names = TRUE,
  append = FALSE
)
```

In this first part we have conducted a very important task: to translate UAV multispectral imagery and CLS severity ground data into a single table storing spectral covariates and severity classes which we will use next for building a prediction model.

### Machine learning model

Now we will train & test a machine learning model to estimate qualitative levels of CLS disease severity using the previously created dataset.

#### Software setup

Let's start by cleaning up R memory:

```{r}
rm(list=ls())
```

Then, we need to install several packages (if they are not installed yet):

```{r}
#| warning: true
list.of.packages <- c("readr","terra", "tidyterra", "stars", "sf", "leaflet", "leafem", "dplyr", "ggplot2", "tidymodels", "xgboost", "tune")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
```

Now, let's load all the required packages:

```{r}
#| warning: false
#| message: false
library(readr)
library(terra)
library(tidyterra)
library(stars)
library(sf)
library(leaflet)
library(leafem)
library(dplyr)
library(ggplot2)
library(tidymodels)
library(xgboost)
library(tune)
```

What files are in the data folder under the current path:

```{r}
list.files("data/cercospora_data")
```

Let's remind that *fcovar_2021.csv* is the file containing the dataset we created previously. We will use the *readr* package to read the data as a tibble:

```{r}
#| warning: false

covar21 <- read_csv("data/cercospora_data/fcovar_2021.csv", 
                    col_types = cols(category = col_character())) %>%
  drop_na()
```

What we got?

```{r}
covar21
```

We will change data type for the *category* column from character to factor:

```{r}
covar21$category <- as.factor(covar21$category)
```

Let's see a summary of the dataset:

```{r}
summary(covar21)
```

Now, we will filter several columns of this tibble using the *dplyr* package. First, get the column names:

```{r}
names(covar21)
```

Now, the selection:

```{r}
covar21 %>% select(Plot,ndvi,ndre, cire, sipi, mndvi, category) -> covar
```

Let's check what we get:

```{r}
covar
```

#### CLS classification using machine learning (ML)

The overall process to classify the crop disease under study will be conducted using the *tidymodels* framework which is an extension of the *tidyverse* suite. It is especially focused towards providing a generalized way to define, run and optimize ML models in R. See <https://rstudio-connect.hu.nl/caait/>

##### Exploratory analysis

As a first step in modeling, it’s always a good idea to plot the data.

Let's try a boxplot:

```{r}
plot <- ggplot(covar, aes(x=category, y=mndvi, color = category))+
     geom_boxplot()+
     r4pde::theme_r4pde()+
     theme(legend.position = "none" )

plot
```

Next, we will do a scatterplot to visualize the relationship between the *ndvi* and *mndvi* indices:

```{r}
ggplot(covar) +
  aes(x = ndvi, y = mndvi, color = category) +
  geom_point(shape = 16, size = 3) +
  labs(x = "ndvi", y = "mndvi", color = "Severity category") +
  r4pde::theme_r4pde()+
    theme(legend.position = "top")
```

##### Splitting the data

Next step is to divide the data into a training and a test set. The set.seed() function can be used for reproducibility of the computations that are dependent on random numbers. By default, the training/testing split is 0.75 to 0.25.

```{r}
set.seed(42)
data_split <- initial_split(data = covar)
data_train <- training(data_split)
data_test <- testing(data_split)
```

Let's check the result:

```{r}
data_train
```

##### Defining the model

We will use a [gradient boosting machine](https://uc-r.github.io/gbm_regression) algorithm which seems a robust algorithm for different applications.

```{r}
spec_lr <-
boost_tree(
  mode = "classification",
  trees=1000,
  #tree_depth = tune::tune(),
  #learn_rate = tune::tune(),
  #loss_reduction = tune::tune()
  ) %>% 
  set_engine("xgboost", objective = "multi:softprob", eval_metric = "mlogloss",) 
```

##### Defining the recipe

The `recipe()` function to be used here has two arguments:

-   A formula. Any variable on the left-hand side of the tilde (\~) is considered the model outcome (here, outcome). On the right-hand side of the tilde are the predictors. Variables may be listed by name, or you can use the dot (.) to indicate all other variables as predictors.

-   The data. A recipe is associated with the data set used to create the model. This will typically be the training set, so data = data_train here.

```{r}
recipe_lr <-
  recipe(category ~ ., data_train) %>%
  add_role(Plot, new_role = "id") |>
  step_zv(all_predictors()) |>
  step_corr(all_predictors())
```

##### Model performance metrics

Next, we need to specify what we would like to see for determining the performance of the model. Different modelling algorithms have different types of metrics. Because we have a multiclass classification problem (different categories of severity), we will chose the accuracy evaluation metric here.

##### Combine model specification and recipe into a workflow

We will want to use our recipe across several steps as we train and test our model. We will:

-   Process the recipe using the training set: This involves any estimation or calculations based on the training set. For our recipe, the training set will be used to determine which predictors should be converted to dummy variables and which predictors will have zero-variance in the training set, and should be slated for removal.

-   Apply the recipe to the training set: We create the final predictor set on the training set.

-   Apply the recipe to the test set: We create the final predictor set on the test set. Nothing is recomputed and no information from the test set is used here; the dummy variable and zero-variance results from the training set are applied to the test set.

To simplify this process, we can use a model workflow, which pairs a model and recipe together. This is a straightforward approach because different recipes are often needed for different models, so when a model and recipe are bundled, it becomes easier to train and test workflows. We’ll use the workflows package from tidymodels to bundle our model with our recipe.

Now we are ready to setup our complete modelling workflow. This workflow contains the model specification and the recipe.

```{r}
wf_cls_wilt <-
  workflow(
    spec = spec_lr,
    recipe_lr
    )

wf_cls_wilt
```

##### Fitting the GBM regression model

Now we use the workflow previously created to fit the model on our training data. We use the training partition of the data.

```{r}
fit_lr <- wf_cls_wilt  %>% 
  fit(data = data_train)
```

Let's check the output:

```{r}
fit_lr
```

Now, we will use the fitted model to estimate CLS severity in the training data:

```{r}
rf_training_pred <- 
  predict(fit_lr, data_train) %>% 
  bind_cols(predict(fit_lr, data_train) %>% 
  # Add the true outcome data back in
  bind_cols(data_train %>% 
              select(category)))
```

What we got?

```{r}
names(rf_training_pred) <- c (".pred_class_1", ".pred_class_2", "category")
```

Let's estimate the training accuracy:

```{r}
rf_training_pred %>%                # training set predictions
  accuracy(truth = category, .pred_class_1) -> acc_train
acc_train
```

The accuracy of the model on the training data is 0.75 which is above 0.5 (mere chance). This basically means that the model was able to learn predictive patterns from the training data. To see if the model is able to generalise what it learned when exposed to new data, we evaluate the model on our hold-out (or so-called test data). We created a test dataset when splitting the data at the start of the modelling.

##### Evaluating the model on the test data

Now, we will use the fitted model to estimate health status in the testing data:

```{r}
lr_testing_pred <- 
  predict(fit_lr, data_test) %>% 
  bind_cols(predict(fit_lr, data_test, type = "prob")) %>% 
  bind_cols(data_test %>% select(category))
```

What we got:

```{r}
lr_testing_pred
```

Let's compute the testing accuracy:

```{r}
lr_testing_pred %>%                   # test set predictions
  accuracy(category, .pred_class)
```

The resulting accuracy is much lower than the accuracy on the training data. This means that we would need to figure out how to improve our classification model.

```{r}
## Let's plot the AUC-ROC 
lr_testing_pred %>% 
  conf_mat(truth = category, .pred_class) -> cm 
cm
```

Note what are the disease categories more difficult to discriminate.

In this section we trained and tested a gradient boosting model (GBM) using three spectral indices as predictor variables (i.e. NDVI, NDRE, CIRE, SIPI, and MNDVI) obtained from imagery and field data. Accuracy metrics show that both data & model need improvement.

Some ideas for further experiments: - Include data collected in 2022 and 2023 - Refine CLS severity categories (see [@delponte2019]) - Compute other vegetation indices as advised in relevant literature - Compute other covariates (e.g. image texture metrics)

This section illustrated how to use VIs derived from UAV-based multispectral imagery and ground data to develop an classification model for estimating CLS in table beet. The results showed that a gradient boosting machine model is unable to estimate severity from several VIs with a good accuracy. We can explore the use of additional covariates to improve accuracy of the initial model.

## Summary

We have completed a journey that hopefully may serve as foundation to apply remote sensing data and techniques to infer information useful for monitoring crop diseases.

Our journey consisted of three stages. In the first section we reviewed basic remote sensing concepts and provided a summary of applications of remote sensing of crop diseases. In the second section we navigated a case study on identification of banana Fusarium wilt from multispectral UAV imagery. In the third section we moved to a case study dealing with estimation of cercospora leaf spot disease severity on table beet. For the two case studies we used datasets publicly available.

Now, it is time to conclude with several reflections about: (i) what we learned in this chapter; (ii) what are the potential and limitations of remote sensing for identification & monitoring crop diseases; and (iii) what are main challenges to overcome to advance the use of remote sensing techniques for identification and quantification of crop diseases.

### Lesson learned in this journey

First of all, we started to become familiar with concepts and techniques that can guide us in future studies focused on the use of remote sensing for crop disease monitoring. Although they are only the tip of the remote sensing iceberg, which we must know in depth, they represent part of the foundation necessary to be successful in any research related to the topic of this chapter.

Along the way, we learned to use some R libraries that allow us to: (i) read, visualize and process multispectral images and geospatial data; (ii) merge raster and vector data and extract spectral covariates related to plant biophysical, biochemical, and functional traits; and (iii) apply machine learning algorithms to infer information useful for plant disease identification and quantification.

The results obtained in our hands-on activities suggest that the application of machine learning algorithms to learning from data is driven by an iterative process: starting from an initial workflow, users need to iteratively modify their workflow, based on previous results, to improve performance.

We have learned that, in order to use data properly, research data needs to be fully described and documented. Otherwise, users may become frustrated or lost in guessing, cleaning, and pre-processing tasks. We have also learned that if the code is well documented, it is feasible to reuse it in future studies.

### Potential and limitations of remote sensing for crop disease monitoring

A variety of remote sensing (RS) devices can be used to obtain data on the spectral response of plants and crops to diseases and insect damage. Processing and analyzing such data allows to estimate information useful for monitoring crops and analysing different stages of plant health. As individual devices have some limitations, multimodal remote sensing data & techniques (i.e. reflectance, fluorescence, thermal, microwave, lidar, multiangular sensing) are increasingly being used in research.

In the multimodal setting, data from different platforms & instruments are used together to obtain a better outcome. Multimodal methods allow for the integration of spectral responses from different & complementary electromagnetic radiation ranges, allowing to quantify different plant responses (e.g. stomatal closure, increased respiration, decreased chlorophyll, altered pigments, modified biochemistry, photosynthetic inhibition, altered growth, modified leaf angle, water content), and analyzing them to obtain a comprehensive understanding of disease progression, including eventual recovery. By using multimodal methods, it is possible to obtain a more accurate and detailed analysis of crop conditions, leading to advanced knowledge, improved decision-making and ultimately, better crop health management.

### Main challenges to overcome

Building more resilient agricultural systems requires concurrent and near-real-time and reliable crop information for effective decision making. Using remote sensing data & techniques for monitoring crop health can become a relevant method to derive crop information at leaf, canopy, field, and farm scales.

In spite of the increasing literature reporting the application of remote sensing for monitoring crop diseases, it seems that there is a lack of open, transparent and reproducible data and methods to ensure a robust evaluation of performance, reliability and generalization of machine learning and deep learning models, a necessary pre-requisite to safely advance the applicability of remote sensing for crop monitoring in the field.

To fully realize the potential of remote sensing for estimating crop health status, there is a need of reliable ground truth data to enable both data calibration (i.e. linking in-situ to remote values) and accuracy assessment of estimations. Since field measurements require a lot of work and time, there is always a risk that the sampling is not correct (e.g. not enough samples, incorrect sample location, or the sampled area does not match the sensor footprint).

Both collaborative research and crowdsourcing could become a solution to overcome the restrictions of ground-truth data accessibility as well as the lack of open, transparent, and reusable remote sensing data & workflows. It is anticipated that joint multidisciplinary research as well as user participation in the complete process of crop monitoring from remote sensing could improve the reliability of crop health information.