-
Notifications
You must be signed in to change notification settings - Fork 16
/
data-sads.qmd
318 lines (215 loc) · 18.3 KB
/
data-sads.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
---
title: "Standard area diagrams"
editor_options:
chunk_output_type: inline
---
## Definitions
According to a glossary on phytopathometry [@bock2021], standard area diagram (SAD) can be defined as "*a generic term for a pictorial or graphic representation (drawing or true-color photo) of selected disease severities on plants or plant parts (leaves, fruit, flowers, etc.) generally used as an aid for more accurate visual estimation (on the percentage scale) or classification (using an ordinal scale) of severity on a specimen".*
The Standard Area Diagrams (SADs), also known as diagrammatic scales, have a long history of use in plant pathology. The concept dates back to the late 1800s when the Cobb scale was developed, featuring five diagrams depicting a range of severity levels of rust pustules on wheat leaves.
In the past 20 years, plant pathologists have leveraged advancements in image processing and analysis tools, along with insights from psychophysical and measurement sciences, to develop SADs that are realistic (e.g., true-color photographs), validated, and depict severities that maximize estimation accuracy. SADs have been created in various color formats (black or white, two-color, or true-color) and with varying incremental scales (approximated linear or logarithmic) [@delponte2017].
SADs have proven beneficial in increasing the accuracy of visual estimates, as estimating percentage areas is generally more challenging than classifying severity into ordinal classes - there are numerous possibilities on the percentage scale, compared to the finite and small number of classes in ordinal scales. A recent quantitative review confirmed that using SADs often results in improved accuracy and precision of visual estimates. However, it also identified factors related to SAD design and structure, disease symptoms, and actual severity that affected the outcomes. In particular, SADs have shown greater utility for raters who are inherently less accurate and for diseases characterized by small and numerous lesions [@delponte2022]. Here are examples of SADs in black and white, two-color, and true-color formats:
------------------------------------------------------------------------
![Actual photos of symptoms of loquat scab on fruit (left) and a SADs with eight diagrams (right). Each number represents severity as the percent area affected [@gonzález-domínguez2014]](imgs/sad-loquat.png){#fig-sad-loquat fig-align="center" width="625"}
------------------------------------------------------------------------
![SADs for Glomerella leaf spot on apple leaf. Each number represents severity as the percent area affected [@moreira2018]](imgs/sad-apple.png){#fig-sad-apple fig-align="center" width="622"}
------------------------------------------------------------------------
![SADs for soybean rust. Each number represents severity as the percent area affected [@franceschi2020]](imgs/sad-sbr.png){#fig-sad-sbr fig-align="center" width="625"}
More SADs can be found in the [SADBank](https://emdelponte.github.io/sadbank/), a curated collection of articles on SAD development and validation. Click on the image below to get access to the database.
[![SADBank, a curated collection of articles](imgs/sadbank.png){#fig-sadbank fig-align="center" width="87%"}](https://emdelponte.github.io/sadbank/)
## SAD development and validation
A systematic review of the literature on SADs highlighted the most important aspects related with the development and validation of the tool [@delponte2017]. A list of best practices was proposed in the review to guide future research in the area. Follows the most important aspects to be noted:
::: callout-tip
## Best practices on SADs development
- Sample a minimum number (e.g., n = 100) of specimens from natural epidemics representing the range of disease severity and typical symptoms observed.
- Use reliable image analysis software to discriminate disease symptoms from healthy areas to calculate percent area affected.
- When designing the illustrations for the SAD set, ensure that the individual diagrams are prepared realistically, whether line drawn, actual photos, or computer generated.
- The number of diagrams should be no less than 6 and no more than 10, distributed approximately linearly, and spaced no more than 15% apart. Additional diagrams (±2) should be included between 0 and 10% severity.
- For the validation trial, select at least 50 specimens representing the full range of actual severity and symptom patterns.
- When selecting raters (a minimum of 15) for validation, make sure they do not have previous experience in using the SAD under evaluation.
- Provide standard instructions on how to recognize the symptoms of the disease and how to assess severity, first without and then with the SAD.
- Ideally repeat the assessment in time, with a 1- or 2-week interval, both without and with the aid, using the same set of raters in order to evaluate the effect of training and experience on gains in accuracy.
- Both pre- and posttest experiment conditions should be the same to avoid any impact of distraction on accuracy of estimates during the tests.
:::
## Designing SADs in R
The diagrams used in a set have been developed using various methods and technologies, ranging from hand-drawn diagrams to actual photographs [@delponte2017]. There is an increasing trend towards using actual photos that are digitally analyzed using standard image analysis software to determine the percent area affected. With this approach, a large set of images is analyzed, and some images are chosen to represent the severities in the SAD according to the scale structure.
In R, the pliman package has a function called `sad()` which allows the automatic generation of a SADs with a pre-defined number of diagrams. Firstly, as shown in the [previous chapter](data-actual-severity.html), the set of images to be selected needs to be analysed using the `measure_disease()` function. Then, a SADs is automatically generated. In the function, the specimens with the smallest and highest severity will be selected for the SAD. The intermediate diagrams are sampled sequentially to achieve the pre-defined number of images after the severity has been ordered from low to high. More details of the function [here](https://tiagoolivoto.github.io/pliman/reference/sad.html).
Let's use the [same set](data-actual-severity.html#multiple-images) of 10 soybean leaves, as seen in the previous chapter, depicting the rust symptoms and create the `sbr` object.
```{r}
#| message: false
#| warning: false
library(pliman)
h <- image_import("imgs/sbr_h.png")
s <- image_import("imgs/sbr_s.png")
b <- image_import("imgs/sbr_b.png")
sbr <- measure_disease(
pattern = "img",
dir_original = "imgs/originals" ,
dir_processed = "imgs/processed",
save_image = TRUE,
img_healthy = h,
img_symptoms = s,
img_background = b,
show_original = FALSE, # set to TRUE for showing the original.
col_background = "white",
verbose = FALSE,
plot = FALSE
)
```
We are ready to run the `sad()` function to create a SADs with five diagrams side by side. The resulting SADs is in two-color as standard. Set the argument `show_original` to `TRUE` for showing the orignal image in the SADs.
```{r}
sad(sbr, 5, ncol = 5)
```
## Analysis of SADs validation data
To evaluate the effect of SAD on accuracy components, analyze the data, preferably using concordance analysis methods ([see chapter](data-accuracy.html)), to fully explore which component is affected and to gain insight into the ramification of errors. Linear regression should not be used as the sole method but it could be complementary for comparison with previous literature.
Inferential methods should be used for testing hypotheses related to gain in accuracy. If parametric tests are used (paired t-test for example), make sure to check that the assumptions are not violated. Alternatively, nonparametric tests (Wilcoxon signed rank) or nonparametric bootstrapping should be used when the conditions for parametric tests are not met. More recently, a (parametric) mixed modelling framework has been used to analyse SADs validation data where raters are taken as a random effects in the model [@gonzález-domínguez2014; @franceschi2020; @pereira2020].
### Non parametric boostrapping of differences
Bootstrap is a resampling method where large numbers of samples of the same size are repeatedly drawn, ***with replacement***, from a single original sample. It is commonly used when the distribution of a statistic is unknown or complicated and the sample size is too small to draw a valid inference.
A bootstrap-based equivalence test procedure was first proposed as complementary to parametric (paired t-test) or non-parametric (Wilcoxon) to analyze severity estimation data in a study on the development and validation of a SADs for pecan scab [@yadav2012]. The equivalence test was used to calculate 95% confidence intervals for each statistic by bootstrapping using the percentile method (with an equivalence test, the null hypothesis is the converse of H0, i.e. the null hypothesis is non-equivalence). In that study, the test was used to compare means of the CCC statistics across raters under two conditions: 1) without versus with the SAD; and 2) experienced versus inexperienced raters.
To apply the bootstrap-based equivalence test, let's work with the CCC data for a sample of 20 raters who estimated severity of soybean rust SAD first without and then with the aid. The CCC was calculated as shown [here](data-accuracy.html#concordance-correlation-coefficient).
```{r}
#| warning: false
library(tidyverse)
library(r4pde)
sbr <- tibble::tribble(
~rater, ~aided, ~unaided,
1, 0.97, 0.85,
2, 0.97, 0.85,
3, 0.95, 0.82,
4, 0.93, 0.69,
5, 0.97, 0.84,
6, 0.96, 0.86,
7, 0.98, 0.78,
8, 0.93, 0.72,
9, 0.94, 0.67,
10, 0.95, 0.53,
11, 0.94, 0.78,
12, 0.98, 0.89,
13, 0.96, 0.8,
14, 0.98, 0.87,
15, 0.98, 0.9,
16, 0.98, 0.87,
17, 0.98, 0.84,
18, 0.97, 0.86,
19, 0.98, 0.89,
20, 0.98, 0.78
)
```
Let's visualize the data using boxplots. Each point in the plot represents a rater.
```{r}
#| warning: false
theme_set(theme_r4pde())
sbr |>
pivot_longer(2:3, names_to = "condition", values_to ="estimate") |>
ggplot(aes(condition, estimate))+
geom_boxplot(outlier.colour = NA)+
geom_jitter(width = 0.05, size = 2, alpha = 0.5)+
theme_r4pde()+
ylim(0.4,1)
```
To proceed with bootstrapping, we first create a new variable to hold the differences between the means of the estimates (aided minus unaided). If the 95% CI does not include zero, this means that there was a significant improvement in the statistics.
```{r}
#| warning: false
# diff of means
sbr$diff <- sbr$aided - sbr$unaided
sbr |>
ggplot(aes(x= diff))+
theme_r4pde()+
geom_histogram(bins = 10, color = "white")
```
Using the simpleboot and boot packages of R:
```{r}
library(simpleboot)
b.mean <- one.boot(sbr$diff, mean, 999)
boot::boot.ci(b.mean)
mean(b.mean$data)
hist(b.mean)
```
Using the bootstrap package:
```{r}
library(bootstrap)
b <- bootstrap(sbr$diff, 999, mean)
quantile(b$thetastar, c(.025,.975))
mean(b$thetastar)
sd(b$thetastar)
se <- function(x) sqrt(var(x)/length(x))
se(b$thetastar)
```
Both procedures shown above have led to similar results. The 95% CIs of the differences did not include zero, so a significant improvement in accuracy can be inferred.
### Parametric and non-parametric paired sample tests
When two estimates are gathered from the same rater at different times, these data points are not independent. In such situations, a **paired sample t-test** can be utilized to test if the mean difference between two sets of observations is zero. This test requires each subject (or leaf, in our context) to be measured or estimated twice, resulting in *pairs* of observations. However, if the assumptions of the test (such as normality) are violated, a non-parametric equivalent, such as the Wilcoxon signed-rank test, also known as the **Wilcoxon test**, can be employed. This alternative is particularly useful when the data are not normally distributed.
To proceed with these tests, we first need to ascertain whether our data are normally distributed. We should also verify whether the variances are equal. Let's now apply these two tests to our data and compare the results.
```{r}
# normality test
shapiro.test(sbr$aided)
shapiro.test(sbr$unaided)
# equal variance test
var.test(sbr$aided, sbr$unaided)
# paired t-test
t.test(sbr$aided, sbr$unaided, paired = TRUE)
# Wilcoxon test
wilcox.test(sbr$aided, sbr$unaided, paired = TRUE)
```
As shown above, the two assumptions were violated, so we could rely more confidently on the non-parametric test.
### Mixed effects modeling
Mixed models, also known as mixed effects models or multilevel models, are an extension of traditional linear models that are used for analyzing hierarchical or clustered data. These models are particularly useful when dealing with data where observations may not be fully independent, or when the assumption of independence is violated. This happens, for instance, when data are collected over time from the same individuals or units, or when individuals are grouped or nested within higher-level units, such as in our case where measurements are taken by different raters [@brown2021].
Mixed models enable us to model both fixed and random effects. Fixed effects represent the usual regression parameters that we are primarily interested in estimating, while random effects model the random variation that occurs at different levels of hierarchy or clustering. They allow us to account for variability among different levels of data, like inter-rater variability or intra-subject variability in repeated measures designs.
In our context, we consider raters as random effects because we view them as a sample drawn from a larger population of potential raters, and our goal is to generalize our findings to this larger population. If we were to sample additional raters, we would expect these new raters to differ from our current ones. However, by considering raters as a random effect in our model, we can account for this inter-rater variability and make more accurate inferences about the overall population.
The random effects component in the mixed model allows us to capture and model the additional variance that is not explicitly accounted for by the fixed effects in our model. In other words, random effects help us to capture and quantify the 'unexplained' or 'residual' variation that exists within and between the clusters or groups in our data. This could include, for instance, variation in disease measurements that are taken repeatedly from the same subjects. In conclusion, mixed models provide a robust and flexible framework for modeling hierarchical or clustered data, allowing us to effectively account for both fixed and random effects and to make more accurate inferences about our data.
Let's start reshaping our data to the long format and assign them to a new data frame.
```{r}
sbr2 <- sbr |>
pivot_longer(2:3, names_to = "condition", values_to = "estimate")
```
Now we fit the mixed model using the `lmer` function of the *lme4* package. We will fit the model to the logit of the estimate because they should be bounded between zero and one. Preliminary analysis using non-transformed or log-transformed data resulted in lack of normality of residuals and heterocedasticity (not shown).
```{r}
#| warning: false
#| message: false
library(lme4)
library(car) # for logit function
mix <- lmer(logit(estimate) ~ condition + (1 | rater), data = sbr2)
# Check model performance
library(performance)
check_normality(mix)
check_heteroscedasticity(mix)
# Check effect of condition
car::Anova(mix)
# Estimate the means for each group
library(emmeans)
em <- emmeans(mix, ~ condition, type ="response")
em
# Contrast the means
pairs(em)
# plot the means with 95% CIs
plot(em) +
coord_flip()+
xlim(0.7,1)+
theme_r4pde()
```
As shown above, we can reject the null hypothesis that the means are the same between the two groups.
Alternatively, we could fit GLMMs - generalized linear mixed models, which extend the traditional linear mixed models to accommodate response variables that follow different distributions. They are particularly useful when the response variable does not follow a normal distribution and cannot be adequately transformed to meet the parametric assumptions of traditional linear models. The glmmTMB package in R provides a convenient and flexible platform to fit GLMMs using a variety of distributions [@brooks2017].
In our case, considering our response variable bounded between 0 and 1, a Beta distribution might be a suitable choice. Beta distribution is a continuous probability distribution defined on the interval [0, 1], and is commonly used for modelling variables that represent proportions or percentages.
The function `glmmTMB()` from the *glmmTMB* package can be used to fit a GLMM with a Beta distribution. In this function, we specify the distribution family as `beta_family()`.
```{r}
#| message: false
#| warning: false
library(glmmTMB)
mix2 <- glmmTMB(estimate ~ condition + (1| rater),
data = sbr2,
family = beta_family())
```
Because the package *performance* does not handle the *glmmTMB* output, we will use the *DHARMa* package in R which can be particularly useful for checking the assumptions of your GLMM fitted with `glmmTMB()`. The package provides a convenient way to carry out residual diagnostics for models fitted via maximum likelihood estimation, including GLMMs. This package creates standardized residuals from the observed responses and the predicted responses of a fitted model, and then compares these residuals to a simulated set of residuals under a correct model.
```{r}
#| message: false
#| warning: false
library(DHARMa)
plot(simulateResiduals(mix2))
```
In this example, `simulateResiduals()` generates simulated residuals from your fitted model, and the plot creates a plot of these residuals. This showed that the residuals from our model are uniformly distributed, which is an assumption of GLMMs. We can now proceed with the posthoc analysis and noticed that the results are similar to when the response variable was transformed to logit.
```{r}
car::Anova(mix2)
library(emmeans)
em <- emmeans(mix2, ~ condition, type = "response")
em
# Contrast the means
pairs(em)
```