-
Notifications
You must be signed in to change notification settings - Fork 0
/
report.Rmd
760 lines (549 loc) · 26.6 KB
/
report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
---
title: "Stepwise Regression with Negatively Correlated Covariates"
author : "Riley Ashton"
bibliography: report.bib
header-includes:
- \usepackage{listings}
- \usepackage{color}
- \usepackage{mathtools}
output:
pdf_document:
highlight: pygments
toc: true
toc_depth: 2
number_sections: true
html_document:
toc: true
theme: spacelab
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(SelectionSimulator)
library(ggplot2); theme_set(theme_minimal())
library(dplyr)
```
\newpage
# Introduction
David Hamilton's 1987 [@hamilton1987sometimes] paper showed that correlated variables are not always
redundant and that $R^2 > r^2_{yx_1} + r^2_{yx_2}$. In Section 3 of Hamilton's paper,
an extreme example was shown when two variables were negatively correlated
and neither did a good job in explaining the response on their own,
but together they explained nearly all the variance ($R^2 \approx 1$).
Forward selection is a greedy algorithm that only examines one variable at a
time. In cases of two highly negatively correlated variables, it is possible
that neither variable would be added, since it is possible that
neither variable would be significant on their own.
Algorithms that consider adding highly negatively correlated variables together,
and additionally any variables correlated with those are discussed and tested.
## Notation used
- p: number of covariates/predictors
- q: number of nonzero covariates/predictors (i.e. covariates in the model)
- n: number of observations
- [AIC: Akaike information criterion](https://en.wikipedia.org/wiki/Akaike_information_criterion)
- [BIC: Bayes information criterion](https://en.wikipedia.org/wiki/Bayesian_information_criterion)
- Step: name used for [traditional stepwise regression](https://en.wikipedia.org/wiki/Stepwise_regression) in this report
- $\mathcal{O}$ refers to [Big O notation](https://en.wikipedia.org/wiki/Big_O_notation).
Formally $f(n) = \mathcal{O}(g(n))$ is defined as
$$\exists k > 0 \, , \exists n_0 \, , \forall n > n_0 \, , |f(n)| \leq k \cdot g(n)$$
\newpage
# Algorithm
## Step2
### Algorithm Idea
The first algorithm is called Step2. The purpose of Step2 improve upon the
traditional stepwise algorithm (in this report called Step), by examining
the correlation between covariates.
In the case of highly negatively correlated covariate pairs, Step2 will
consider the traditional choices of adding a single variable,
but also the option of adding both negatively correlated
covariates to the model (refered to as a block). What constitutes a highly
negatively block is determined by the parameter `cor_cutoff`.
So a cutoff of -0.5 would mean
all covariate pairs with a correlation of less than -0.5 (e.g. -0.7) would
be additionally considered together. They would also be considered as singles,
as is every other variable not yet included in the model. In this report a
`cor_cutoff` of -0.5 is used. This choice of -0.5 for `cor_cutoff` was arbitrary.
While both Step and Step2 are greedy algorithms, Step2 looks two steps
ahead at highly negatively correlated covariate pairs.
AIC or BIC is used as the metric for deciding which of the possible models
is the best at each step. This is controlled by the parameter `k`.
`k` = 2 means AIC is used and `k` = $log_e (n)$ means BIC by Step2.
In this report $k=\log_e(n)$ or BIC is used.
\newpage
### Pseudocode
Note this pseudocode uses the R notion of formulas, which are manipulated
similar to strings. In other languages a list or set of data matrix indices may
be stored and appended to and removed from.
```{r, eval=FALSE, echo=TRUE}
let pairs = covariate pairs (x,y) such that cor(x,y) > cor_cutoff,
let current_info_crit = information criterion of starting formula
let current_formula = starting formula
let next_formulas = empty list
let past_formulas = empty lookup container
loop
if(direction is "forward" or "both",
and the number of observations > the number of covariates in current_formula)
append (current_formula + y) to next_formulas where y is
a covariate not yet in current_formula
append (current_formula + x + y) to next_formulas where (x,y) is in
pairs and neither x nor y already appear in current_formula
end if
if(direction is "backwards" or "both"
and the number of observations > the number of covariates in current_formula)
remove y from current_formula and append it to next_formulas
where y is a covariate already in current_formula
append (current_formula + x + y) to next_formulas where (x,y) is in
pairs and both x and y already appear in current_formula
end if
if(length of next_formulas is 0)
return the model corresponding to the current_formula
end if
let X = fit a glm to the data for each formula in next_formulas
let min_info_crit = the minimum information criterion for the models in X
let min_formula = the formula corresponding to the model with the min_info_crit
if(min_formula in past_formulas)
a cycle is present, return an error
endif
else if(min_info_crit <= current_info_crit)
add current_formula to past_formulas
current_info_crit = min_info_crit
current_formula = min_formula
next_formulas = empty list
end elseif
else if(min_info_crit > current_info_crit)
return the model corresponding to the current_formula
endif
end loop
```
### Time and Space Complexity vs Step
The time complexity of the algorithm depends heavily on the choice of
`cor_cutoff`. The tradeoff is between model that is potentially more accurate
and a longer run time.
The returns in model accuracy, to `cor_cutoff` closer to zero, decrease
rapidly. Step rarely has a problem with only slightly negative correlated variables,
so there is little for Step2 to improve on in those situations.
The time complexity of Step2 is generally within a constant of Step.
At worst it will be `p` times greater, where `p` is the number of parameters.
This is since Step chooses from at most `p` models to fit at each stage,
where Step2 choose from at most `p` choose 2 models.
The space complexity is largely unchanged from the traditional algorithm,
since only the AIC or BIC or each model is stored.
## Step3
### Algorithm Idea
Step3 does everything that Step2 does, but it also considers recursing on the
pairwise negatively correlated covariates, i.e. the blocks of Step2,
and considering anything highly
correlated with them and then anything highly correlated with those, etc until
it reaches a block size of `max_block_size`.
Additionally, Step3 with a `max_block_size` of 2 is equivalent to Step2.
The depth used in this report is set at 3, so the algorithm will,
at most, consider including three covariates at a time. This choice was arbitrary.
What is classified as a highly correlated variable for the purposes of recursion
are set by the two parameters `recursive_cor_positive_cutoff` and
`recursive_cor_negative_cutoff`. In this report 0.5 and -0.5 was used for
`recursive_cor_positive_cutoff` and
`recursive_cor_negative_cutoff`, respectively. The choice of 0.5 and -0.5 was arbitrary.
- Let $\land$ represent logical $\text{AND}$
- Let $\lor$ represent logical $\text{OR}$
- Let $\text{CC}$ be a short-form for `cor_cutoff`
- Let $\text{RCN}$ be a short-form for `recursive_cor_negative_cutoff`
- Let $\text{RCP}$ be a short-form for `recursive_cor_positive_cutoff`
- Let $r(x_1,x_2)$ denote the sample correlation between $x_1$ and $x_2$
- Let $S_1$ denote the set of covariates
- Then $S_2$ denoting the set of highly negatively correlated pairs $(x_1,x_2)$ is defined as
$$S_2 = \{(x_1, x_2) \, | \, x_1 \in S_1 \land x_2 \in S_1 \land x_1 \neq x_2 \land r(x_1,x_2) < \text{CC} \}$$
- Then $S_3$ denoting the set of triples $(x_1,x_2,x_3)$ is defined as
$$S_3 = \{ (x_1, x_2, x_3) \, | \, (x_1,x_2) \in S_2 \, \land x_3 \, \in S_1 \, \land \, x_3 \notin (x_1, x_2) \land \left(\text{block } x_1 \, x_2 \, x_3 \right) \}$$
where $$\text{block } x \, y \, z = (r(x,z) < \text{RCN}) \lor (r(x,z) > \text{RCP}) \lor (r(y,z) < \text{RCN}) \lor (r(y,z) > \text{RCP})$$
- Likewise this pattern continues
- Then, in the genral case, $S_n$ denoting the set of length $n$, $(x_1,x_2, \dots, x_n)$ is defined as
$$S_n = \{ (x_1, \dots, x_n) \, \, | \, \, (x_1, \dots, x_{n-1}) \in S_{n-1} \, \land \, x_n \in S_1 \, \land \, x_n \notin (x_1, \dots, x_{n-1}) \, \land \, \left(\text{block2 } (x_1, \dots, x_{n-1}) \, x_n \right) \}$$
where $$\text{block2 } D \, c = \exists a, b \in D \, | \, a \neq b \land \text{block } a \, b \, c $$
\newpage
### Pseudocode Changes from Step2
The core of the algorithm is the same as Step2, except for computing pairs
and adding or removing pairs to current formula. This is because pairs
are now generalized to lists of size $\geq 2$. In the case of adding or
removing pairs to current formula, Step3 appends the objects in lists to
the current formula if none are currently in the current formula and it
removes them if all are currently in the current formula.
### Time and Space Complexity vs Step
With a terrible choice of cutoff parameters (e.g. correlation < 1 or > -1)
and unlimited recursion depth, Step3 will preform all subsets regression.
All subsets regression has an exponential time complexity in the number of covariates.
The worst case space and time complexity for a `max_block_size` of $\text{mbs}$ is
$\mathcal O\left( p ^ {(\text{mbs}-1)}\right)$ times that of Step.
With proper choices of cutoffs the time and space complexity is expected
to be within a constant of Step.
Note that the expected proportional increase in running time could be computed
before running any simulations for any given data set and algorithm parameters.
A possible future improvement could involve the algorithm guaranteeing
similar performance to Step by tuning the algorithm parameters for a given data set.
\newpage
# Simulations
## Linear model
- $\mathbf Y = \beta_0 + \beta_1 \mathbf X_1 + \beta_2 \mathbf X_2 + \cdots + \epsilon$
- $\mathbf X_i$ are correlated, centred random normal variables
- Intercept ($\beta_0 = 9$)
- $\epsilon \sim \mathcal N(\mu = 0, \sigma = 1)$
- 1000 Simulations
## Generating Correlated Centred Random Normal Variables
The variables are generated according the the formula
$$\mathbf X = \mathbf{C Z}$$
This generates a vector of correlated random normal variables
$\mathbf X \sim \mathcal N(\boldsymbol \mu = 0, \mathbf \Sigma)$
where $\mathbf C$ is a $p \times p$ matrix that can be solved
from the given covariate matrix, $\boldsymbol \Sigma$, such that
$$\boldsymbol \Sigma = \mathbf {C C^T} =
\begin{bmatrix}
\sigma_1^2 & \sigma_{12} & \cdots & \sigma_{1p} \\
\sigma_{21} & \sigma_2^2 & \cdots & \sigma_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
\sigma_{1p} & \sigma_{2p} & \cdots & \sigma_p^2
\end{bmatrix}$$
$\mathbf C$ is found using [Cholesky decomposition](https://en.wikipedia.org/wiki/Cholesky_decomposition)
Where $\mathbf Z$ is a vector of random normal variables
$$\mathbf Z = \begin{bmatrix} z_1 \\ z_2 \\ \vdots \\ z_p \end{bmatrix}$$
Where $Z_i \sim \mathcal N(\mu = 0, \sigma = 1)$
## Model 1 : Two negatively correlated
This model is designed to show the advantage of Step2 over Step
$$ \boldsymbol{\Sigma} =\begin{bmatrix}
1 &-0.8 & 0 & 0 & 0 & 0 \\
-0.8& 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 0 & 1
\end{bmatrix} \hspace{10pt}
\hspace{20pt}
\boldsymbol{\beta} = \begin{bmatrix} 1 \\ 1 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}$$
$n$ = 100
## Model 2: Three Negatively Correlated
This model is designed to show the advantage of Step3 over Step2 and Step
$$\boldsymbol{\Sigma} = \begin{bmatrix}
1 & -0.8 & 0.25 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
-0.8 & 1 & -0.75 & 0 & 0& 0 & 0 & 0 & 0 & 0 \\
0.25 & -0.75 & 1 & 0 & 0& 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 0& 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\
\end{bmatrix}
\hspace{10pt}
\hspace{20pt}
\boldsymbol{\beta} =
\begin{bmatrix} 1 \\ 1 \\ 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0
\end{bmatrix}$$
$n$ = 100
## Model 3 : Big p
This model is designed to show the advantages of Step2 and Step3 over Step
when $p >> n$. Note the covariates V6 through V100 are independent and do not
contribute to the response ($\beta_i = 0$ for $i \in [6..100]$).
$$ \boldsymbol \Sigma = \begin{bmatrix}
1 & -0.8 & 0.25 & 0 & 0 & 0 &\cdots & 0\\
-0.8& 1 & -0.5 & 0 & 0 & 0 & \cdots & 0 \\
0.25& -0.5 & 1 & 0 & 0 & 0 & \cdots & 0\\
0 & 0 & 0 & 1 & -0.75 & 0 & \cdots & 0\\
0 & 0 & 0 & -0.75 & 1 & 0 & \cdots & 0\\
0 & 0 & 0 & 0 & 0 & 1 &\cdots & 0\\
\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & 0 & 0 & 0 & 0 & \cdots & 1
\end{bmatrix}
\hspace{10pt} \hspace{20pt}
\boldsymbol \beta =
\begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 0 \\ \vdots \\ 0 \\ \end{bmatrix}
$$
$n$ = 50
```{r models}
twoNeg <- readRDS("./simulations/twoNeg.RDS")
threeNeg <- readRDS("./simulations/threeNeg.RDS")
big_p <- readRDS("./simulations/big_p.RDS")
```
\newpage
# Results
## Model 1 : Two Negatively Correlated
In this case Step3 did not end up selecting any more covariates to add
to its block inclusion via its recursion, so it was identical to Step2.
They were identical since only two
covariates were high negatively correlated ($X_1$ and $X_2$) and
neither $X_1$ nor $X_2$ were highly correlated with any other covariates.
This meant that Step2 and Step3 considered including $X_1$ and $X_2$ in the
same step, but did not consider adding any other variables in the
same step.
Both Step2 and Step3 outperformed Step, since Step often ($\approx 30\%$)
selected no covariates to include in the model (i.e. the intercept only model).
This is due to the highly negative sample correlation ($r \approx -0.8$)
between $X_1$ and $X_2$.
Since Step did not include $X_1$ and $X_2$ about 40% of the time,
while Step2 and Step3 included them nearly all the time,
the variance of the coefficient estimates for $X_1$ and $X_2$ were
much higher in the Step simulations than in Step2 or Step3.
```{r two_neg_plot_1}
betas_heat_map(twoNeg)
```
Here the heat map for the first 100 simulations and the values of the model
coefficients are shown. Note that NA values are shown as black, meaning
that coefficient was not selected in that simulation. Step often misses both
X1 and X2, while Step2 and Step3 nearly always include both of these variables.
******
```{r}
proportion_included(twoNeg)
```
Table 1 shows the proportion of times that each algorithm selected
a given covariate. Step did not select X1 or X2 over a third of the time.
Step2 and Step3 nearly always select X1 and X2.
[In the appendix is the proportion of times that each algorithm selected the correct model.](#model-1)
```{r}
coeff_bias(twoNeg)
coeff_variance(twoNeg)
```
The bias of the fitted coefficients $\text{bias}_{\beta_j} = \left(\frac{1}{n} \sum_{i=1}^n \beta_{ij} \right) - \left( \frac{1}{n} \sum_{i=1}^n \hat \beta_{ij} \right)$, shown in Table 2,
is greatly reduced for X1 and X2 under Step2 and Step3 compared to Step.
The variance of the fitted values for X1 and X2, shown in Table 3 is also reduced since Step included
X1 and X2 about 60% of the time
******
```{r two_neg_coeff}
inclusion_order(twoNeg, pdf_head = 5)
```
Table 4 shows the 5 most commonly occuring orders that each variable was included,
by each stepwise algorithm.
[The full tables are available in the appendix.](#model-1)
Note that the character | delineates the steps of the selection algorithm and
|| denotes the termination of the selection algorithm.
For example, || denotes the algorithm terminated without selecting any variables,
while X1X2 | X3 || means the algorithm selected both X1 and X2 in the first step,
X3 in the second step and then terminated.
Step2 and Step3 are identical, as explained at the start of the section.
Step did not select any variables over a quarter of the time. In nearly all cases,
Step2 and Step3 selected X1 and X2 together. It is ability to include highly
negatively correlated covariates together that makes Step2 and Step3 ideal
in this type of case with high pairwise negative correlations.
```{r eval=FALSE, include=FALSE}
# Must be manually updated if simulation changes
sample_cor_matrix(twoNeg)
```
$$\begin{bmatrix}
X & -0.89/-0.64 & -0.3/0.33 & -0.31/0.28 & -0.29/0.33 & -0.32/0.33 \\ -0.89/-0.64 & X & -0.26/0.3 & -0.35/0.37 & -0.31/0.28 & -0.27/0.3 \\ -0.3/0.33 & -0.26/0.3 & X & -0.31/0.27 & -0.29/0.28 & -0.31/0.33 \\ -0.31/0.28 & -0.35/0.37 & -0.31/0.27 & X & -0.29/0.3 & -0.27/0.33 \\ -0.29/0.33 & -0.31/0.28 & -0.29/0.28 & -0.29/0.3 & X & -0.32/0.32 \\ -0.32/0.33 & -0.27/0.3 & -0.31/0.33 & -0.27/0.33 & -0.32/0.32 & X \end{bmatrix}$$
The above matrix shows the sample pairwise correlation matrix of all the simulations.
looking at the first off diagonal entry of $-0.89/-0.67$ means that the minimum
sample pairwise correlation between $X1$ and $X2$ was $-0.89$, while the maximum
was $-0.67$. This shows that, for all simulations,
Step2 and Step3 would have the option of including $X1$ and
$X2$ together, but would not have the option to do so for any other variables.
******
```{r two_neg_tables}
test_sse2kable(twoNeg)
training_sse2kable(twoNeg)
```
``` {r echo=FALSE}
X_test <- test_sse2dataframe(twoNeg)
reduction_test <- (X_test["Step", "Mean"] - X_test["Step2", "Mean"]) / X_test["Step", "Mean"]
reduction_test <- round(reduction_test, digits = 3)
X_train <- training_sse2dataframe(twoNeg)
reduction_train <- (X_train["Step", "Mean"] - X_train["Step2", "Mean"]) / X_train["Step", "Mean"]
reduction_train <- round(reduction_train, digits = 3)
```
Table 5 and Table 6 show the test and training SSE from each algorithm.
Step2 and Step3 are, again, identical. Step2 and Step3 reduced the training
SSE and test SSE compared to Step by `r paste0(100 * reduction_train, "%")`
and `r paste0(100 * reduction_test, "%")`, respectively.
[Boxplots of test and training SSE are located in the appendix](#model-1)
\newpage
## Model 2 : Three Negatively Correlated
In addition to examing singles, like Step, Step2 and Step3 would examine pairs of X1, X2 and X3.
Finally Step3 would additionally consider adding X1, X2 and X3 all together.
This results in Step2 having better results than Step and Step3 having
better results than Step2.
```{r}
betas_heat_map(threeNeg)
```
Examining the heat map of fitted coefficients, all algorithms have difficulty
selecting all the correct covariates. This is an especially difficult
correlation structure. However Step3 does a considerably better job at selecting
all of X1, X2 and X3 compared to Step and Step2.
******
```{r}
proportion_included(threeNeg)
```
Table 7 shows the proportion of times that each algorithm selected
a given covariate. Step had very poor performance in selecting
X1 and X3, about a fifth and a third of the time respectively.
Step2 offered slght improvements with these two variables, but Step3
showed a marked improvement selecting both X1 and X3 compared to Step or Step2.
It is also worth noting that Step2 and Step3 performed better in selecting X2
over Step. This suggests that X2 was not always significant by itself.
[In the appendix is the proportion of times that each algorithm selected the correct model.](#model-2)
```{r}
coeff_bias(threeNeg)
coeff_variance(threeNeg)
```
Step2 reduces the bias of X1, X2 and X3 (Table 8), but suffers greater fitted coefficient variance
compared to Step (Table 9). Step3 further reduces the bias of X1, X2 and X3,
but suffers greater fitted coefficient variance
compared to Step2. The increase in variance is most likely due to the fact that
the proportion that X1 and X3 are selected approaches 50% with Step3 (Table 7).
Additionally the proportion that Step3 selected at least the correct covariates (Table 21),
or the correct model (Table 20) is far closer to 50% compared to Step.
[These two tables, Table 20 and Table 21, are in the appendix.](#model-2)
******
\newpage
```{r}
inclusion_order(threeNeg)
```
In Table 10, Step3 shows its recursive ability here by including X1, X2 and X3 together
in one single step.
$$\begin{psmallmatrix}
X & -0.89/-0.61 & -0.05/0.5 & -0.31/0.35 & -0.32/0.33 & -0.31/0.33 & -0.32/0.26 & -0.33/0.29 & -0.32/0.32 & -0.34/0.32 \\ -0.89/-0.61 & X & -0.85/-0.58 & -0.38/0.37 & -0.36/0.34 & -0.3/0.33 & -0.28/0.25 & -0.27/0.33 & -0.35/0.31 & -0.32/0.3 \\ -0.05/0.5 & -0.85/-0.58 & X & -0.34/0.32 & -0.38/0.31 & -0.3/0.32 & -0.34/0.32 & -0.3/0.32 & -0.3/0.28 & -0.36/0.27 \\ -0.31/0.35 & -0.38/0.37 & -0.34/0.32 & X & -0.35/0.29 & -0.26/0.35 & -0.32/0.27 & -0.29/0.35 & -0.31/0.36 & -0.31/0.29 \\ -0.32/0.33 & -0.36/0.34 & -0.38/0.31 & -0.35/0.29 & X & -0.28/0.31 & -0.32/0.32 & -0.29/0.31 & -0.34/0.36 & -0.34/0.31 \\ -0.31/0.33 & -0.3/0.33 & -0.3/0.32 & -0.26/0.35 & -0.28/0.31 & X & -0.31/0.36 & -0.35/0.27 & -0.31/0.28 & -0.3/0.31 \\ -0.32/0.26 & -0.28/0.25 & -0.34/0.32 & -0.32/0.27 & -0.32/0.32 & -0.31/0.36 & X & -0.31/0.27 & -0.28/0.29 & -0.3/0.29 \\ -0.33/0.29 & -0.27/0.33 & -0.3/0.32 & -0.29/0.35 & -0.29/0.31 & -0.35/0.27 & -0.31/0.27 & X & -0.34/0.34 & -0.31/0.27 \\ -0.32/0.32 & -0.35/0.31 & -0.3/0.28 & -0.31/0.36 & -0.34/0.36 & -0.31/0.28 & -0.28/0.29 & -0.34/0.34 & X & -0.31/0.29 \\ -0.34/0.32 & -0.32/0.3 & -0.36/0.27 & -0.31/0.29 & -0.34/0.31 & -0.3/0.31 & -0.3/0.29 & -0.31/0.27 & -0.31/0.29 & X \end{psmallmatrix}$$
```{r eval=FALSE, include=FALSE}
# Must be manually updated if simulation changes
sample_cor_matrix(threeNeg)
```
Examining the sample correlation matrix, shown above,
Step3 always had the option to include X1, X2 and X3 together.
******
\newpage
```{r threeNeg_tables}
test_sse2kable(threeNeg)
training_sse2kable(threeNeg)
```
``` {r echo=FALSE}
X_test <- test_sse2dataframe(threeNeg)
reduction_test <- (X_test["Step", "Mean"] - X_test["Step3", "Mean"]) / X_test["Step", "Mean"]
reduction_test <- round(reduction_test, digits = 3)
X_train <- training_sse2dataframe(threeNeg)
reduction_train <- (X_train["Step", "Mean"] - X_train["Step3", "Mean"]) / X_train["Step", "Mean"]
reduction_train <- round(reduction_train, digits = 3)
```
As evidenced by Table 11 and Table 12, Step2 and Step had similar performance.
Step2 had only slightly better test and training SSE than Step.
The similar performance, between Step and Step2, is most likely due to the
complex correlation structure of the covariates. Step2 is only designed to handle
high levels of correlation between 2 covariates at a time, but the correlation
structure here involves 3 covariates (X1, X2 and X3), so Step2's performance is less
than optimal. Step3, being designed to handle complex correlation structures
with its recursive ability, improved upon the performance of Step and Step2.
Step3 reduced the training
SSE and test SSE compared to Step by `r paste0(100 * reduction_train, "%")`
and `r paste0(100 * reduction_test, "%")`, respectively.
[Boxplots test and training SSE located in appendix](#model-2)
\newpage
## Model 3 : Big p
In this simulation $p >> n$ and the curse of dimensionality is present in the results
```{r}
only_correct_predictors_included(big_p)
```
There was so much noise that Step2 and Step3 selected the
correct model only once and Step never selected it (Table 13).
```{r}
at_least_correct_predictors_included(big_p)
```
When judging algorithms on including at least the covariates of the correct model,
Step3 did the best followed by Step2 and then Step (Table 14).
******
```{r}
test_sse2kable(big_p)
training_sse2kable(big_p)
```
Table 15 and 16 show the test and training SSE results.
The median value of 0 for training SSE is not surprising, since it is
easy to fit a perfect training model in the case of $p >> n$.
``` {r echo=FALSE}
X_test <- test_sse2dataframe(big_p)
reduction_test <- (X_test["Step", "Mean"] - X_test["Step2", "Mean"]) / X_test["Step", "Mean"]
reduction_test <- round(reduction_test, digits = 3)
X_train <- training_sse2dataframe(big_p)
reduction_train <- (X_train["Step", "Mean"] - X_train["Step2", "Mean"]) / X_train["Step", "Mean"]
reduction_train <- round(reduction_train, digits = 3)
```
Step2 reduced the training
SSE and test SSE compared to Step by `r paste0(100 * reduction_train, "%")`
and `r paste0(100 * reduction_test, "%")`, respectively.
``` {r echo=FALSE}
X_test <- test_sse2dataframe(big_p)
reduction_test <- (X_test["Step", "Mean"] - X_test["Step3", "Mean"]) / X_test["Step", "Mean"]
reduction_test <- round(reduction_test, digits = 3)
X_train <- training_sse2dataframe(big_p)
reduction_train <- (X_train["Step", "Mean"] - X_train["Step3", "Mean"]) / X_train["Step", "Mean"]
reduction_train <- round(reduction_train, digits = 3)
```
Step3 reduced the training
SSE and test SSE compared to Step by `r paste0(100 * reduction_train, "%")`
and `r paste0(100 * reduction_test, "%")`, respectively.
[Boxplots of test and training SSE are located in the appendix](#model-3)
# Conclusion
Step2 and Step3 offer improvements to the traditional stepwise algorithm in
the case of highly negatively correlated covariates, in terms of model fit.
They can suffer from longer runtimes, especially when their algorithm parameters
are chosen without regard for runtime.
\newpage
# Appendix
## Future Inquiries
- A general recommendation for setting the algorithm parameters of Step2 or Step3
- Using crossvalidation to set the algorithm parameters
- Avoiding algorithm parameters choices that cause the asymptotic runtime to
become greater than the traditional stepwise selection algorithm
## Additional Tables & Plots
### Model 1
```{r}
only_correct_predictors_included(twoNeg)
```
```{r}
at_least_correct_predictors_included(twoNeg)
```
*******
\newpage
```{r}
inclusion_order(twoNeg, pdf_head = 100)
```
******
\newpage
```{r}
training_sse_boxplot(twoNeg)
```
******
\newpage
```{r}
test_sse_boxplot(twoNeg)
```
******
\newpage
### Model 2
```{r}
only_correct_predictors_included(threeNeg)
```
```{r}
at_least_correct_predictors_included(threeNeg)
```
```{r}
inclusion_order(threeNeg, pdf_head = 46)
```
*****
\newpage
```{r}
training_sse_boxplot(threeNeg)
```
*****
\newpage
```{r}
test_sse_boxplot(threeNeg)
```
*****
\newpage
### Model 3
```{r}
training_sse_boxplot(big_p)
```
*****
\newpage
```{r}
test_sse_boxplot(big_p)
```
*****
\newpage
## Algorithm Source Code URL
- View source code at [https://github.com/riley-ashton/Selection/tree/master/R](https://github.com/riley-ashton/Selection/tree/master/R)
- View report code at [https://github.com/riley-ashton/Selection-Report](https://github.com/riley-ashton/Selection-Report)
# References