-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmv-similar.rmd
301 lines (225 loc) · 10 KB
/
mv-similar.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
---
title: "Similarities in Multivariate Data Sets"
author: "Mark Newman"
date: "2022-02-13"
editor_options:
chunk_output_type: console
---
# Intro
Sometimes you need to generate synthetic data.
There is nothing wrong with this.
There are a lot of academic papers that are built on this premise.
The issue is that when this is done, several quasi-experiment objections can be raised.
The most important of which is:
> How do you know the synthetic data is _close enough_ to the real data to draw any valid inference.
This walk-through seeks to provide examples as to why this is important, somethings that work, and mode importantly things to **NOT DO**.
```{r message = F}
library(ICSNP)
library(mvnormtest)
library(biotools)
library(dplyr)
library(tidyr)
library(lubridate)
library(ggplot2)
library(knitr)
library(kableExtra)
```
# Data Set
Lets examine a univariate data set first to explain what we mean by similar synthetic data.
Below is two days of **REAL** data.
The data was collected using a [HOBO MX2202](https://www.amazon.com/gp/product/B075X2SWKN) temperature and light data logger (Logger 1) and a [HOBO MX2201](https://www.amazon.com/gp/product/B075X1MPRW) temperature logger (Logger 2).
The data loggers were placed approximately four inched apart on top of a glass mason jar in an indoor facility for the purposes of product evaluation.
```{r}
data <- read.csv('mv-similar.csv')
data <-
data %>%
mutate(Date.Time = mdy_hm(Date.Time))
head(data) %>%
kable(
caption = 'Evaluation Data',
col.names = c('Date', 'Temperature', 'Light', 'Temperature')) %>%
kable_styling() %>%
add_header_above(header = c(' ' = 1, 'MX2202' = 2, 'MX2201' = 1))
```
Looking at the data, we see that the two `Logger`s are reasonably close when comparing along `Temperature`.
```{r}
data <-
data %>%
select(Date.Time, Temperature.1, Temperature.2) %>%
pivot_longer(cols = starts_with('Temperature'), names_to = 'Logger', values_to = 'Temperature') %>%
separate(Logger, into = c(NA, 'Logger')) %>%
mutate(Logger = factor(Logger, levels = 1:2))
ggplot(data, aes(x = Date.Time, y = Temperature, color = Logger)) +
theme_bw() +
geom_line()
```
Now, there are differences.
For our purposes, the question is going to be, are there too many differences.
# Analysis
The original question of _close enough_ here is straight forward.
The two loggers were inches apart and the curves look almost identical.
The problem is that looking at something is not a **_test_**.
We need a statistical test to help us determine whether or not there is a significant difference between the two distributions.
This is especially the case where:
1. Hundreds of curves need to be examined
2. The data is multivariate
The first thing that comes to mind is a `t.test()`.
Lets see how that works.
```{r}
t.test(Temperature ~ Logger, data = data)
```
We see is that the p-value is 0.3176, which is > 0.05, so we fail to reject $H_0$ and say there is no difference in group means.
There are two issues here:
1. We are testing for equal distributions, not group means
2. We have multivariate data
On the first issue, we may consider this a moot point because our final experiment may be concerned with group means (these data loggers were bought for that purpose).
If this is the case, we may be tempted to just proceed forward.
While not appropriate, it _might_ not impact the overall intent of the experiment.
We _will_ revisit this point later.
However, extreme caution must be had because of point 2.
When we look at our previous graph we were directly using `Date.Time`.
This might not be obvious at first, but this meant we were explicitly moving our data from univariate to mulitvatiate.
We promptly failed to incorporate this change into our test.
This becomes painfully obvious when we reorder `Temperature` while keeping `Date.Time` in place.
```{r results = 'hold'}
t1 <-
rbind(
data %>%
filter(Logger == '1'),
data %>%
filter(Logger == '2') %>%
mutate(Temperature = sort(Temperature)))
ggplot(t1, aes(x = Date.Time, y = Temperature, color = Logger)) +
theme_bw() +
geom_line()
t.test(Temperature ~ Logger, data = t1)
rm(t1)
```
The curves will look very different, while the test returns the _exact_ same result.
The straight forward fix to this is to use the multivariate `HotellingsT2()`.
Lets see how that works.
# Analysis - Multivariate
When running multivariate tests, the assumptions are normally important.
However, as the first test is merely a stepping stone, we will be ignore best practice.
In our case of the `HotellingsT2()` test, there are 4 assumptions.
1. Data is independent
2. No sub populations
3. The data is multivariate normal
4. Both groups have the same variance-covariance matrix
An astute observer will notice that there is clearly auto-correlation.
If this test were our final step, we would be especially concerned.
The sub population assumptions can safely be ignored for this example.
None of the data is repeated and the data came from exactly two different devices.
When looking at multivariate normal results, the auto-correlation can be seen in full effect.
```{r}
mn <- min(data$Date.Time)
t1 <-
data %>%
mutate(
Minute = difftime(Date.Time, mn, units = "mins"),
Minute = as.integer(Minute)) %>%
select(Logger, Temperature, Minute) %>%
as.data.frame()
mshapiro.test(t(t1[t1$Logger == '1',c('Temperature', 'Minute')]))
mshapiro.test(t(t1[t1$Logger == '2',c('Temperature', 'Minute')]))
rm(mn)
```
Normally, we want to see `mshapiro.test()` $\alpha$ > 0.05, so we can fail to reject $H_0$, and say the there is no evidence to suggest the data is not multivariate normal.
With our p-value being very small (p < 0.0001), we are forced to say the opposite, there _is_ evidence to suggest the data is not normal.
While `HotellingsT2()` is known to be robust verses this assumption, it is not _that_ robust.
When looking at the variance-covariance matrix, we find our first "good" assumption.
```{r}
boxM(t1[,c('Temperature', 'Minute')], t1[,'Logger'])
```
Even considering the _significant_ departure from normality found above, our p-value is > 0.05, so we can fail to reject $H_0$, and say the there is no evidence to suggest `Logger` 1 and 2 have different matrices.
Given our prior commitment to willful ignorance, we continue on with the test.
```{r}
g1 = t1[t1$Logger == '1',c('Temperature', 'Minute')]
g2 = t1[t1$Logger == '2',c('Temperature', 'Minute')]
HotellingsT2(g1, g2)
rm(t1, g1, g2)
```
At this point our p-value > 0.05, so we can fail to reject $H_0$, and say the there is no evidence to suggest the data has a different joint group means.
Let that sink in a minute.
Even though we went through the process, however questionably sound it may be, of using a multivariate test, we still end up in the same place: no difference even when there is clearly a difference.
This leads us back to our first issue, we want to test to make sure the distributions are equal, not the group means.
Now that we know that group means is not helping we need to come up with a better plan.
We need to reset back so similar data and try another approach.
# Analysis - $\chi^2$
One possibility is the `chisq.test()`.
The first application of the `chisq.test()` that may be found in many texts is differences between a known and an observed distribution.
This usually comes in the form of comparing [Weldon's dice](https://en.wikipedia.org/wiki/Raphael_Weldon) to the binomial distribution.
The issue for this particular data set is that `chisq.test()` works on count data.
Our data is definitively not that.
We can overcome this hurtle by using a bucketing technique.
The main problem with bucketing, is selecting the bucket size.
For this data set, it is not so bad, but a lot of care needs to be placed here.
For this data set, `8` buckets will be selected for each of the two variables `Temperature` and `Minute`.
```{r results = 'hold'}
b <- 8
mn <- min(data$Date.Time)
t1 <-
data %>%
mutate(
Minute = difftime(Date.Time, mn, units = "mins"),
Minute = as.integer(Minute)) %>%
select(Logger, Temperature, Minute) %>%
mutate(
TBlock = ntile(Temperature, b),
MBlock = ntile(Minute, b))
(t1 <- xtabs(~ TBlock + MBlock + Logger, data = t1))
rm(b, mn)
```
Looking at the two matrices, we see the fatal flaw of $\chi^2$, zero entries.
This is expected for this data set because we are looking at a multivariate distribution.
In fact, it is exactly this reason that we care about the _joint_ `Temperature` x `Minute` distribution.
Sometimes our `Minute` data just doesn't have all the ranges for `Temperature` and we want that to be taken into account.
We can move past the excessive zero entries, by flattening the matrices and removing entries where both are zero.
```{r}
g1 <- as.vector(t1[,,1])
g2 <- as.vector(t1[,,2])
indx <- !(g1 == 0 & g2 == 0)
g1 <- g1[indx]
g2 <- g2[indx]
(t1 <-
c(g1, g2) %>%
matrix(nrow = 2, byrow = T) %>%
as.table())
chisq.test(t1, simulate.p.value = T)
rm(t1, g1, g2, indx)
```
At this point our p-value < 0.05, so we reject $H_0$, and say the there is evidence to suggest the count data differed between the first and second distribution.
This creates a big problem for us.
Even comparing two similar curves, our data is still different.
1. Carefully manage the bucket sizes
2. Check to see what happens when we resort the data like we did before
Lets try the data re-sorting experiment again to see if `chisq.test()` gets $H_0$ only in cases where we expect.
```{r}
b <- 8
t2 <-
rbind(
t1 %>%
filter(Logger == '1'),
t1 %>%
filter(Logger == '2') %>%
mutate(Temperature = sort(Temperature))) %>%
mutate(
Date = date(Date.Time),
Minute = difftime(Date.Time, Date, units = "mins"),
Minute = as.integer(Minute)) %>%
select(Logger, Temperature, Minute) %>%
mutate(
TBlock = ntile(Temperature, b),
MBlock = ntile(Minute, b))
(t2 <- xtabs(~ TBlock + MBlock + Logger, data = t2))
g1 <- as.vector(t2[,,1])
g2 <- as.vector(t2[,,2])
indx <- !(g1 == 0 & g2 == 0)
g1 <- g1[indx]
g2 <- g2[indx]
(t2 <-
c(g1, g2) %>%
matrix(nrow = 2, byrow = T) %>%
as.table())
chisq.test(t2)
```