-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathGeyser-Analysis.tex
720 lines (584 loc) · 31.4 KB
/
Geyser-Analysis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
% !TeX program = XeLaTeX
\documentclass[english,final]{scrartcl}
\usepackage{Sweave}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{mathtools}
\usepackage{babel}
\usepackage[english=british]{csquotes}
\usepackage[backend=biber]{biblatex}
\addbibresource{biblio.bib}
\usepackage{fontspec}
\setmainfont{Linux Libertine O}
\setsansfont{Linux Biolinum O}
\usepackage{microtype}
\usepackage{siunitx}
\usepackage{booktabs}
\usepackage[colorlinks=true, linkcolor=blue, pdfborder={0 0 0}]{hyperref}
\usepackage{todonotes}
\usepackage[left]{showlabels}
\title{An analysis of the Old Faithful Geyser}
\subtitle{End of term report for the lecture\\Computational Statistics}
\author{Tim Adler}
\begin{document}
\input{Geyser-Analysis-concordance}
\maketitle
%\listoftodos
%!Rnw root = Geyser-Analysis.Rnw
%!TeX root = Geyser-Analysis.Rnw
% concordance = no idea
% echo: Do not automatically print entered R code
% Load important libraries
% Setup variables
% Setup the geyser data structure and colors
% Loss function used
% Official Prediction Function
% Often used plots
% Fit function using the above defined loss function
% Internally uses the R optim method
% Furthermore a helper method that generates for a function fn(x,p) the
% counterpart to minimize over.
% Computes the prediction error of our model using cross validation
% and a special loss function.
% Think about adding error handling.
% Computes the average misses with this predictive function fn.
% Does not use predictive error, however. Perhaps change later
% If the predictive function predicts too short, how much longer does
% one have to wait on average?
% Creates a logistic function that has the two asymptotic levels as
% parameters. The "velocity" and "location" of switch will be fixed.
%!Rnw root = Geyser-Analysis.Rnw
%!TeX root = Geyser-Analysis.Rnw
\section{Introduction}
\label{sec:introduction}
In this report we want to analyse the eruptions of the Old Faithful Geyser. This is a geyser at the Yellowstone National Park, which is known for its predictability, hence the name 'faithful'.
We will use the data used by Azzalini \& Bowman \cite{data}, which was measured from August 1, 1985 to August 15, 1985. The measurements were taken manually an example page of the notes can be found in \cite[2]{data}.
The data set consists of two variables with overall 299 measurements. The types of the variables can be found in Table~\ref{tab:str}. For details please consult Appendix~\ref{sec:summary}.
\begin{table}[htbp]
\centering
\begin{tabular}{rrrr}
\toprule
Variable & Name & Type & Comment\\
\midrule
1 & duration & numeric & Eruption duration in \si{\minute}\\
2 & waiting & numeric & Waiting time since last \\
& & & eruption in \si{\minute}\\
\bottomrule
\end{tabular}
\caption{Structure of the geyser data set}
\label{tab:str}
\end{table}
From the handwritten notes, we take that not all duration measurements were done thoroughly at night times only a classification was given (\enquote{short}, \enquote{medium}, \enquote{long}). Azzalini \& Bowman chose to translate those to \SI{2}{\minute}, \SI{3}{\minute} and \SI{4}{\minute}. In this analysis we use an implementation of the geyser data set which is given by the R framwork. Explicitely we use R version 3.2.3 (2015-12-10) on a x86-64-pc-linux-gnu machine and the geyser data set given by the \texttt{MASS} library (Version 7.3-45). In this implementation the translation is already incorporated. Figure~\ref{fig:cluster} shows lines at the appropriate durations. We chose to leave the data points in as they do not seem to influence the overall shape of the distribution that much. Additionally we do not want to describe the data set perfectly, but merely want a predictive model.
As stated, in this analysis the main focus will lie on the task of finding a suitable predictive model for the waiting time until the next eruption. Before 1998 the waiting time was predicted using a linear model only taking the last eruption duration into account. According to Cook \& Weisberg \cite{pred} this function is given by
\begin{align}
\text{(next waiting time [min])} = 30 + 10 \cdot \text{(duration [min])}.
\label{eq:opred}
\end{align}
This graph can be found in Figure~\ref{fig:cluster} (left, violet line) and we can use it to determine the requirements for our predictive model. The central one is, that we do not so much want to describe the data set perfectly, but instead focus on minimzing the number of missed eruptions while on the same time trying to have a bearable additional waiting time if our model predicts \enquote{too short}.
To achieve this goal we will analyse the shape of the data and propose a proper framework for our model to translate our requirements into statistics.
%!Rnw root = Geyser-Analysis.Rnw
%!TeX root = Geyser-Analysis.Rnw
\section{Cluster detection}
\label{sec:data}
We have two variables, the (last) waiting time and the duration, which we want to use to predict the next waiting time. Figure~\ref{fig:cluster} shows that our data points cluster into three clouds. These clouds are color coded. We see that there is no obvious linear dependence between any of the two plotted relations. The official predictive function just seems to run parallel to the line through the two centers of duration plot clouds.
\vspace{-0.8cm}
\begin{figure}[htbp]
\setkeys{Gin}{width=0.45\textwidth}
\centering
\includegraphics{Geyser-Analysis-dur_opred_plot}
\hspace{1cm}
\includegraphics{Geyser-Analysis-wait_plot}
\caption{Color coded marginal scatter plots with separators and park model.}
\label{fig:cluster}
\end{figure}
We separate the three clouds by the two vertical lines depicted in Figure~\ref{fig:cluster}. The duration separator is at 3.1 and the wating time separator is at 67. This is not a perfect seperation as some points of the blue grouping mix with the green grouping and vice versa. Also the exact location of the waiting time split is up to some arbitration. It turns out that this is not a problem, because only one model (and not the best one) uses this separation. There are no other clusters in this data set as can be seen in Figure~\ref{fig:density} in Appendix~\ref{sec:details} on page~\pageref{fig:density}.
Azzalini \& Bowman \cite{data} state a rough physical model for the evolution of the waiting time, which involves a periodic process with period 2. It does not seem likely to us that we have such a simple periodicity. The details can be found in Appendix~\ref{sec:recurrence}.
%!Rnw root = Geyser-Analysis.Rnw
%!TeX root = Geyser-Analysis.Rnw
% This file contains the fits of the four chosen models.
% 0.) Fit waiting distribution
% Fit a mixed normal distribution to the distribution of the
% waiting times (no regard to duration)
% Fit a mixed normal distribution to the above distribution. The weight
% now depends logistically on the duration.
% 1.) Fit Logistic interpolation
% 2.) Fit PLM (dur)
% 3.) Fit LM (dur+wait)
% 4.) Fit PLM (dur+wait)
% 5.) Compute Predictive Error (Cross Validation)
% !!! Evaluation is normally turned of, since it takes quite long. The
% !!! values are buffered (manually) in the variables in setup.Rnw
% !!! Last one alone took 45 min to compile.
%!Rnw root = Geyser-Analysis.Rnw
%!TeX root = Geyser-Analysis.Rnw
\section{Model Framework}
We had a look at four different models and compared their predictions to the official predictive model. For that we set up our own loss function as depicted in Figure~\ref{fig:loss} on page~\pageref{fig:loss}. The procedure to determine its parameter can be found in Appendix~\ref{sec:loss}. The four investigated models are:
\vspace{-0.5ex}
\begin{enumerate}
\setlength{\itemsep}{-0.5ex}
\item A logistical interpolation of the two clouds shown in Figure~\ref{fig:cluster} (left).
\item A piecewise linear regression separating the two duration clouds.
\item A linear regression taking the duration as well as the last waiting time into account.
\item A piecewise linear regression separating the three clouds in Figure~\ref{fig:cluster} taking the duration and the last waiting time into account.
\end{enumerate}
The exact implementations of the fitting algorithm and the models is described in Appendix~\ref{sec:implementation} and~\ref{sec:models}. We also used cross validation to determine the predcitive total loss of our models (c.f.\ Appendix~\ref{sec:implementation}). The results can be found in Table~\ref{tab:model_summary}. The residuals of each of the models together with their fitted parameters are stated in Appendix~\ref{sec:models}.
\begin{table}[htbp]
\centering
\sisetup{
table-number-alignment = right,
table-figures-integer = 4,
table-figures-decimal = 2
}
\begin{tabular}
{r
S
S[table-auto-round]
S[table-auto-round]
S[table-auto-round]
S[table-auto-round]}
\toprule
& {Number of}& {Fitted} & {Predictive} & {Average} & {Average}\\
& {Param} & {Total Loss} & {Total Loss} & {Misses} & {Additional Wait}\\
& & & & {[\%]} & {[\si{\minute}]}\\
\midrule
Official & 2&4053.83& 4053.83 & 10.07&8.74564675746269\\
\midrule
Logistic Interpolation & 2
& 3445.19879915965 & 3488.03
& 2.3489932885906 & 10.4680371105143\\
Pw Linear (dur) & 4
& 3431.46347406616 & 3548.98 & 3.02013422818792
& 10.0984895296407\\
Linear (dur+wait) & 3
& 3533.46797628286 & 3873.74
& 1.67785234899329 & 11.0869214207606\\
Pw Linear (dur+wait) & 9
& 3184.26955186659 & 3354.9
& 3.35570469798658 & 9.07732483287011\\
\bottomrule
\end{tabular}
\caption{Summary of the models}
\label{tab:model_summary}
\end{table}
\section{Conclusion}
In conclusion we see that differentiating between the different clusters strongly improves the models. Model 3 (the only one not using clusters) can be discarded immediately. Its predictive total loss is by far the worst, because of the lack of used structure. The last model has the best predictive loss and smallest average waiting time, however the average miss ratio is slightly worse than for the other two. Together with the high amount of parameters necessary I would advise against its use. Perhaps in a later analysis one could consider a piecewise linear model with duration and waiting time as predictors, but only taking the duration separation into account. This would reduce the number of parameters to 6. As a standard choice I would recommend Model 1 as it has the best predictive behaviour, if the additional waiting time is of grave importance one could consider switching to Model 2, although the improvement is nigh negligible.
\newpage
\appendix
%!Rnw root = Geyser-Analysis.Rnw
%!TeX root = Geyser-Analysis.Rnw
\section{Descriptive Statistics}
\label{sec:details}
\subsection{Extended summary}
\label{sec:summary}
In the following section we describe the geyser data set in more detail. In addtion to what was mentioned in Section~\ref{sec:introduction}, we want to add that we have to keep in mind that waiting time $i$ of the data set denotes the waiting time \emph{before} eruption $i$ occured. Thus waiting time $i$ incorporates duration $i-1$. We could separate the two. However, as can be seen in Table~\ref{tab:summary}, the duration is much smaller than the waiting time such that the effect on the prediction would be negligible.
\begin{table}[htbp]
\centering
\sisetup{
table-number-alignment = right,
table-figures-integer = 3,
table-figures-decimal = 2
}
\begin{tabular}{r
S[table-auto-round]
S[table-auto-round]}
\toprule
& {waiting} & {duration}\\
& {[\si{\minute}]} & {[\si{\minute}]}\\
\midrule
Min & 43 & 0.8333333\\
1st Quantile & 59 & 2\\
Median & 76 & 4\\
Mean & 72.314381270903 & 3.46081382508361\\
3rd Quantile & 83 & 4.3833333\\
Max & 108 & 5.45\\
\bottomrule
\end{tabular}
\caption{Summary of the geyser data set}
\label{tab:summary}
\end{table}
In Figure~\ref{fig:density} we see a 3 dimensional scatter plot of the next waiting time over waiting time and duration. A density estimation has been added and we see, that there are three clouds visible. These correspond exactly to the coloring in Figure~\ref{fig:cluster}.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.65\textwidth]{density.png}
\caption{Three dimensional scatter plot with estimated density. We find three clusters corresponding to the clusters marked in Figure~\ref{fig:cluster}}
\label{fig:density}
\end{figure}
%!Rnw root = Geyser-Analysis.Rnw
%!TeX root = Geyser-Analysis.Rnw
\subsection{Periodicity and Recurrence Plots}
\label{sec:recurrence}
As Azzalini \& Bowman state in \cite{data}, they suspect a rough physical model underlying the eruption pattern of the geyser. One prediction of this model were that the waiting time would have a period of 2, i.e.\ the waiting time would alternate between \enquote{long} and \enquote{short}. Figure~\ref{fig:ts} (left) seems to back this idea.
\begin{figure}[htbp]
\setkeys{Gin}{width=0.45\textwidth}
\centering
\includegraphics{Geyser-Analysis-plot_wait_period}
\hspace{1cm}
\includegraphics{Geyser-Analysis-rec_waiting_plot}
\caption{Left: Extract of the waiting time time series. Right: Extract of the waiting time recurrence plot.}
\label{fig:ts}
\end{figure}
To investigate the periodicity of our data set further, we used a recurrence plot as introduced in \cite{recurrence}. Recurrence plots give us a mark at coordinate $(i,j)$ whenever the waiting time at point $i$ is \emph{sufficiently} close to that of $j$. In our case we compared three consecutive data points starting at $i$ with their counterpart at $j$ and wanted them to be closer together than one standard deviation (13.89), which is a rather lenient restriction.
We see in Figure~\ref{fig:ts} (right), that we have basically no recurrences at $(i,i+2)$, which we would expect, if we had a periodicity of two. There are quite a few other recurrences, which is not surprising as our closeness constraint is rather lax, however it seems that all in all the period is either too long or too complex to be taken into account in the further analysis.
\begin{figure}[htbp]
\setkeys{Gin}{width=0.45\textwidth}
\centering
\includegraphics{Geyser-Analysis-plot_wait_period2}
\hspace{1cm}
\includegraphics{Geyser-Analysis-rec_waiting_plot2}
\caption{Left: Full waiting time time series. Right: Full waiting time recurrence plot.}
\label{fig:ts2}
\end{figure}
\subsection{Nonparametric and parametric density estimation}
\label{sec:density}
Since we want to predict the waiting time it is interesting how the waiting time is distributed overall. For that we utilised a density estimation. We used the \texttt{density} method of R together with the Sheather-Jones-Algorithm(c.f.~\cite{sj}) for Bandwidth gauging. The problem is that we wanted to estimate the overall shape without averaging out important structure. To take this into account we also had a look at estimated densities with a far smaller bandwidth and found that the estimated bandwidth is appropriate for our purpose. The result of the estimation can be found in Figure~\ref{fig:wait} (left, black curve).
\begin{figure}[htbp]
\setkeys{Gin}{width=0.45\textwidth}
\centering
\includegraphics{Geyser-Analysis-dens_estim}
\hspace{1cm}
\includegraphics{Geyser-Analysis-plot_mn_density2}
\caption{Density estimation of the Waiting time. Left: Duration independet. Right: Duration dependent.}
\label{fig:wait}
\end{figure}
We observe that the distribution is strongly bimodal and from the form we deduce that it should be well described by a mixed normal distribution. We did this approximation (blue curve in Figure~\ref{fig:wait} left) following the procedure described in \cite{density}. Our density candidate is given by
\begin{Schunk}
\begin{Sinput}
> dwait1 <- function(x,p) {
+ p[1]*dnorm(x, mean = p[2], sd = p[3]) +
+ (1-p[1])*dnorm(x, mean = p[4], sd = p[5])
+ }
> dwait1.min <- function(x, p) {
+ e <- dwait1(x,p)
+ if(any(e <= 0)) Inf else -sum(log(e))
+ }
\end{Sinput}
\end{Schunk}
We minimize \texttt{dwait1.min} using the built in minimizer \texttt{optim} of R. As we see in Figure~\ref{fig:wait} the blue curve is in rather good agreement with the estimated density. However, the left peak is a little off and the overall form of it differs from the nonparametric density.
To remedy this shortcoming we want to introduce (as described in \cite{density}) another parameter and take the duration into account. Looking at Figure~\ref{fig:cluster}, it is clear that the duration has a strong impact on the distribution of the waiting time, so it seems sensible to make the density duration dependent. For this we make the \enquote{mixing coefficent} duration dependent. Since this coefficient has to vary between 0 and 1 it is natural to assume a logistic relation to the duration. Implemented this takes the following form:
\begin{Schunk}
\begin{Sinput}
> dwait2 <- function(x,y,p) {
+ q <- exp(p[1] + p[2]*y)
+ q <- q/(1+q)
+ q*dnorm(x, mean = p[3], sd = p[4]) +
+ (1-q)*dnorm(x, mean= p[5], sd = p[6])
+ }
> dwait2.min <- function(x,y,p){
+ e <- dwait2(x,y,p)
+ if(any(e <= 0)) Inf else -sum(log(e))
+ }
\end{Sinput}
\end{Schunk}
Analogously to \texttt{dwait1} we minimized it using \texttt{optim}. We see that for the one additional parameter we get a minimum of 985.32 as compared to 1154.24 for \texttt{dwait1}. This is a remarkable improvement. In Figure~\ref{fig:wait} (right) the waiting time distribution is plotted at different durations. We see that if the duration time is either very high or very low, the bimodal distribution degenerates to a common normal distribution.
Since we now have this very managable duration dependent distribution, we can easily obtain a predictive model for the waiting time. We simply compute the expectation value of the distribution, which depends logistically on the duration. This is the very first predictive model we use as can be seen in Appendix~\ref{sec:models}.
%!Rnw root = Geyser-Analysis.Rnw
%!TeX root = Geyser-Analysis.Rnw
\section{The loss function}
\label{sec:loss}
In order to achieve our goal to find a suitable model to predict the waiting time of the faithful geyser, we need to work with an appropriate loss function.
The standard loss function (i.e.\ minimizing over the quadratic error) does not seem to be a permissible candidate. A good model should take into account that it is far less annoying for a visitor to wait a couple of minutes at the geyser before the eruption starts than being a few minutes too late. Because of the symmetry of the quadratic error, it cannot distinguish between 'being too early' and 'being late'.
Thus we will use the following loss function
\begin{Schunk}
\begin{Sinput}
> loss <- function(x, q){
+ x[x < 0] <- q
+ x
+ }
\end{Sinput}
\end{Schunk}
with parameter $q$.
As can be be seen in Figure~\ref{fig:loss}, the function attributes a constant high loss, if the prediction is too long and a linear gain of loss, if the predicted waiting time is shorter than the actual waiting time. This fullfills the requirement that a missed eruption weighs heavier than additional waiting time.
\begin{figure}[htbp]
\setkeys{Gin}{width=0.45\textwidth}
\centering
\includegraphics{Geyser-Analysis-plot_loss}
\caption{Proposed loss function. Positive time corresponds to a shorter predicted waiting time than the actual waiting time.}
\label{fig:loss}
\end{figure}
The last necessary step is to determine the parameter of this function. At first glance it seems that there are two: The slope of the linear part and the intercept of the constant part. However scaling the function by a (postive) factor does not change the result of a minimization. So we are free to fix the slope to 1 and are left with only the intercept as parameter. In order to determine it, we use the official predictive function of the National Park (as mentioned in~\cite{pred}, c.f.\ Equation~\eqref{eq:opred} on page~\pageref{eq:opred}).
The procedure to determine our parameter $q$ now consists in the following idea: If we have the right $q$ and we fit a linear model using this loss function we want to get the official park function. This idea is sensible, as we can assume that there certainly was a bit of optimization (and rounding) involved in the determination of the park function and we thus can make sure that we roughly apply the same error margin as the park.
Therefore the (manually) applied algorithm consists in systematically choosing different values for $q$ and then fitting a linear model. Aftewards we check the intercept and slope. If both are \enquote{close enough} to 30 respectively 10, we found our parameter.
The fitting procedure used is described in Appendix~\ref{sec:implementation}. To make sure that we end up in the right local minimum, we choose 30 and 10 as the inital parameters for the fit of the linear model.
For $q = 57$ we yield fit parameters 29.95 and 9.02. We also see that the models are rather robust with respect to small changes in $q$. So $q = 57$ seems an acceptable choice. $q$ lies in a sensible region, too. A short sanity check: With that choice it is as bad to miss an eruption (and thus having to wait another average waiting time) as it is bad to have to wait another average waiting time (c.f.\ Table~\ref{tab:summary} on page~\pageref{tab:summary}) to the next eruption after the predictive model said it should be. This sounds reasonable.
With this we get the total loss of the predictive park function as 4053.83. This is from now on our margin to top for alternative models.
%!Rnw root = Geyser-Analysis.Rnw
%!TeX root = Geyser-Analysis.Rnw
\section{The fit and cross validation routine}
\label{sec:implementation}
If we are given predictors $x \coloneqq (x_i)_{i=1,\dots,n}$ and regressors $ y \coloneqq (y_i)_{i=1,\dots,n}$ and a model function $f_p$ with parameter(s) $p$. We want to choose the parameter(s) in a manner such that the overall loss of the prediction is minimal. That means we want to solve the minimisation problem
\[
\min_p g(p, x, y) \coloneqq \min_p \sum_{i=1}^n \operatorname{loss}(y_i - f_p(x_i)).
\]
In R this problem takes the following form: The method \texttt{genmin} takes the function $f_p$ and generates the function $g$ we want to minimize over. The method \texttt{lossfit} takes $f_p$ internally generates $g$ and then uses the R intern minimizer \texttt{optim} with the Nelder-Mead-Algorithm (c.f.~\cite{nm}) and initial parameter $p_0$ to estimate the parameter $p$. In code this looks as follows:
\begin{Schunk}
\begin{Sinput}
> # takes a function fn(x,p) and generates a function to minimize over.
> genmin <- function(fn, l = loss, bound = q) {
+ function(p,x,y) {
+ result <- 0
+ if(is.atomic(x)) {
+ for(i in 1:length(y)) {
+ result <- result + l(y[i] - fn(x[i], p), bound)
+ }
+ } else {
+ for(i in 1:length(y)) {
+ result <- result + l(y[i] - fn(x[i,], p), bound)
+ }
+ }
+ #sum(loss(y - fn(x,p), bound))
+ result
+ }
+ }
> # fitting/minimization taking place here
> # p0: initial parameter
> # x: all predictors (as data.frame)
> # y: the (1d) regressor
> # l: custom loss function
> # bound: bound of the loss function
> lossfit <- function(p0, fn, x, y, l = loss, bound = q){
+ fn.min <- genmin(fn, l = l, bound = bound)
+ optim(p0, fn.min, x = x, y = y)
+ }
\end{Sinput}
\end{Schunk}
As any numerical optimizer/minimizer, \texttt{optim} only finds local minima and therefore $p$ critically depends on the chosen $p_0$. The method applied to finding $p_0$ is given for each model separately in Appendix~\ref{sec:models}.
The cross validation routine iterates over the regressor predictor pairs, always leaving one out and computing the fit completely analogously to the fit method. Afterwards it uses the parameter to compute the loss of the left out data point. These losses are added up and returned. The code is as follows:
\begin{Schunk}
\begin{Sinput}
> # p0: initial parameter
> # x: all predictors (as data.frame)
> # y: the (1d) regressor
> # l: custom loss function
> # bound: bound of the loss function
> losscv <- function(p0, fn, x, y, l = loss, bound = q) {
+ fn.min <- genmin(fn, l = l, bound = bound)
+ result <- list(value = 0, warn = c())
+ for(i in 1:length(y)) {
+ o <- optim(p0, fn.min, x = x[-i,], y = y[-i])
+ if (o$convergence != 0) {
+ result$warn <- append(result$warn, values = i)
+ }
+ result$value <- result$value + fn.min(o$par, x[i,], y[i])
+ }
+ result
+ }
\end{Sinput}
\end{Schunk}
Again we have the problem of having to choose an initial parameter $p_0$. Therefore we reused the initial parameters of the fit routine.
%!Rnw root = Geyser-Analysis.Rnw
%!TeX root = Geyser-Analysis.Rnw
\section{Model implementation and residues}
\label{sec:models}
In the following we show the implementation of each of the used models, the residues after the fit, the strategy for obtaining the initial parameters and the values of the estimated parameters.
\subsection{Logistic interpolation with regressor: duration}
This model is just a standard implementation of a logistic function interpolating between two plateaus. The parameters of the exponential function, which determine the transition region and speed are fixed and taken from the fit of the duration dependent waiting time distribution (c.f.\ Appendix~\ref{sec:density}). Thus the implementation takes the form:
\begin{Schunk}
\begin{Soutput}
A single object matching ‘lg1.wait’ was found
It was found in the following places
.GlobalEnv
with value
function(x,p) {
if(is.atomic(x)) {
d <- x
} else {
d <- x$d
}
q <- exp(l[1] + l[2]*d)
q <- q/(1+q)
q*p[1] + (1-q)*p[2]
}
<environment: 0x26191a8>
\end{Soutput}
\end{Schunk}
As initial parameters for the plateaus we chose the mean of the two peaks in the waiting time distribution (again, c.f.\ Appendix~\ref{sec:details}). The result is shown in Table~\ref{tab:log1}. The residue of the fit is depicted in Figure~\ref{fig:residue1} (left) on page~\pageref{fig:residue1}. All in all there is not much structure in the residues except for a few outliers which are easily identified with the outliers in the waiting time distribution.
\begin{table}[htbp]
\centering
\sisetup{
table-number-alignment = right,
table-figures-integer = 2,
table-figures-decimal = 2
}
\begin{tabular}{r
S[table-auto-round]
S[table-auto-round]}
\toprule
& {Param1} & {Param2}\\
\midrule
Initial & 55.14 & 81.09\\
Terminal & 44.85 & 71.01\\
\bottomrule
\end{tabular}
\caption{Parameters of the logistic interpolation model}
\label{tab:log1}
\end{table}
\subsection{Piecewise linear model with regressor: duration}
This model is very close to the official predictive model. The only difference is, that it respects the two clusters in the wating time duration plot. Thus we get two linear models, one for each cluster. Implemented this looks as follows:
\begin{Schunk}
\begin{Sinput}
> plm1.dur <- function(x, p) {
+ if(is.atomic(x)) {
+ d <- x
+ } else {
+ d <- x$d
+ }
+ result <- c()
+ for(i in 1:length(d)) {
+ if(x[i] < 3.1) {
+ result <- append(x = result, values = p[1] + p[2]*d[i])
+ } else {
+ result <- append(x = result, values = p[3] + p[4]*d[i])
+ }
+ }
+ result
+ }
\end{Sinput}
\end{Schunk}
As the model is adapted from the official model. We use the parameters of it as initial parameters for our fit. The result of our regression can be found in Table~\ref{tab:plm1} and the residues are in Figure~\ref{fig:residue1} (right). Again we see almost no structure, except for the same outliers as in the model before.
\begin{table}[htbp]
\centering
\sisetup{
table-number-alignment = right,
table-figures-integer = 2,
table-figures-decimal = 2
}
\begin{tabular}{r
S[table-auto-round]
S[table-auto-round]
S[table-auto-round]
S[table-auto-round]}
\toprule
& {Param1} & {Param2} & {Param3} & {Param4}\\
\midrule
Initial & 30 & 10
& 30 & 10\\
Terminal & 29.55 & 9
& 42.33 & 6.67\\
\bottomrule
\end{tabular}
\caption{Parameters of the piecwise linear (dur) model}
\label{tab:plm1}
\end{table}
\begin{figure}[htbp]
\setkeys{Gin}{width=0.45\textwidth}
\centering
\includegraphics{Geyser-Analysis-log1_plot}
\hspace{1cm}
\includegraphics{Geyser-Analysis-plm1_plot}
\caption{Residuals of Model 1 (left) and Model 2 (right)}
\label{fig:residue1}
\end{figure}
\subsection{Linear model with predictors: duration \& waiting time}
This model is a slight adaption of the official predictive model, too. Instead of taking the clustering into account we add the last waiting time as a regressor. This leads to the following implementation:
\begin{Schunk}
\begin{Sinput}
> lm1.durwait <- function(x,p) {
+ d <- x$d
+ w <- x$w
+ p[1] + p[2]*d + p[3]*w
+ }
\end{Sinput}
\end{Schunk}
Because of the close relation to the park model, we use its parameters as inital parameters. For the waiting time coefficient we use 0 as we suspect it will only act as a small correction. The results can be found in Table~\ref{tab:lm1} and the residues are depicted in Figure~\ref{fig:residue2} on page~\pageref{fig:residue2}. As before, there is almost no structure in the residues, except for the known outliers.
\begin{table}[htbp]
\centering
\sisetup{
table-number-alignment = right,
table-format = -2.2
}
\begin{tabular}{r
S[table-auto-round]
S[table-auto-round]
S[table-auto-round]}
\toprule
& {Param1} & {Param2} & {Param3} \\
\midrule
Initial & 30 & 10
& 0\\
Terminal & 30.49 & 10.02
& -0.05\\
\bottomrule
\end{tabular}
\caption{Parameters of the linear (dur+wait) model}
\label{tab:lm1}
\end{table}
\subsection{Piecewise linear model with predictors: duration \& waiting time}
This model is a refinement of the previous one. Again we use duration and waiting time as predictors, yet this time we apply a linear model to each of the three identified clusters separately. This increases the number of parameters to 9 and the code takes the following form:
\begin{Schunk}
\begin{Sinput}
> plm2.durwait <- function(x,p) {
+ d <- x$d
+ w <- x$w
+ result <- c()
+ for(i in 1:length(d)) {
+ r <- 0
+ if(d[i] <= dsep) {
+ r <- p[1] + p[2]*d + p[3]*w
+ } else {
+ if(w[i] <= wsep) {
+ r <- p[4] + p[5]*d + p[6]*w
+ } else {
+ r <- p[7] + p[8]*d + p[9]*w
+ }
+ }
+ result <- append(x = result, values = r)
+ }
+ result
+ }
\end{Sinput}
\end{Schunk}
Because of the close relation to the park model. We use its parameters as inital parameters. For the different waiting times we use 0, as we suspect only a small corrective effect. The result is summarized in Table~\ref{tab:plm2} and the residues can be found in Figure~\ref{fig:residue2} (right). In this case there is a little more structure than in the other cases and we have the same outliers. Additionally we see that we miss more eruptions than in the other models. Together with the high number of necessary parameters, these are indicators against the use of this model.
\begin{table}[htbp]
\centering
\sisetup{
table-number-alignment = right,
table-format = -2.2
}
\begin{tabular}{r
S[table-auto-round]
S[table-auto-round]
S[table-auto-round]
S[table-auto-round]}
\toprule
& {Param1} & {Param2} & {Param3} & {Param4}\\
\midrule
Initial & 30 & 10
& 0 & 30\\
Terminal & 29.98 & 10.11
& -0.03 & 32.68\\
\bottomrule
\end{tabular}
\vphantom{h}
\vspace{0.3cm}
\vphantom{h}
\begin{tabular}{S[table-auto-round]
S[table-auto-round]
S[table-auto-round]
S[table-auto-round]
S[table-auto-round]}
\toprule
{Param5} & {Param6} & {Param7} & {Param8} & {Param9}\\
\midrule
10 & 0
& 30 & 10
& 0\\
10.11 & -0.01
& 30.01 & 9.88
& -0.04\\
\bottomrule
\end{tabular}
\caption{Parameters of the piecwise linear (dur+wait) model}
\label{tab:plm2}
\end{table}
\begin{figure}[htbp]
\setkeys{Gin}{width=0.45\textwidth}
\centering
\includegraphics{Geyser-Analysis-lm1_plot}
\hspace{1cm}
\includegraphics{Geyser-Analysis-plm2_plot}
\caption{Residuals of Model 3 (left) and Model 4 (right)}
\label{fig:residue2}
\end{figure}
\printbibliography[heading=bibintoc]
\end{document}