-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
245 lines (182 loc) · 9.07 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
Emulator
C.Coleman-Smith, [email protected],
H.Canary, <cs.unc.edu/~hal>
Gaussian Process regression in C with an R library too
Brief:
A project to implement scalar Gaussian process regression with a
variety of covariance functions with an optional (up to 3rd order)
linear regression prior mean. Maximum likelihood hyper-parameters can
be estimated to parametrize a given set of training data. For a given
model (estimated hyperparams and training data set) the posterior mean
and variance can be sampled at arbitrary locations in the parameter
space.
The model parameter space can take any dimension (within limits of the
optimizer), output must be scalar.
Primary user interface is through the interactive_emulator program
~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=
Depends:
cmake (http://www.cmake.org/cmake/resources/software.html)
gsl
R (http://www.r-project.org/) (for using the R lib)
~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=
Compiling:
./build.sh
On OS-X you need to set the following environment variables:
export DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:"~/local/lib"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"~/local/lib"
Otherwise R will complain when you try and load libRbind.dylib
~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=
Interactive Emulator:
useage:
interactive_emulator estimate_thetas INPUT_MODEL_FILE MODEL_SNAPSHOT_FILE [OPTIONS]
or
interactive_emulator interactive_mode MODEL_SNAPSHOT_FILE
or
interactive_emulator print_thetas MODEL_SNAPSHOT_FILE
estimate_thetas -> train an emulator on a given set of model
outputs, all the information needed to use the emulator is
written to MODEL_SNAPSHOT_FILE
interactive_mode -> sample the emulator contained in
MODEL_SNAPSHOT_FILE at a set of points given on STDIN, the
emulated mean and variance are output to STDOUT
print_thetas -> write out the covariance length scales for the
emulator given by MODEL_SNAPSHOT_FILE
INPUT_MODEL_FILE can be "-" to read from standard input.
The input MODEL_SNAPSHOT_FILE for interactive_mode must match the
format of the output MODEL_SNAPSHOT_FILE from estimate_thetas.
Options for estimate_thetas can include:
--regression_order=0
--regression_order=1
--regression_order=2
--regression_order=3
--covariance_fn=0 (POWER_EXPONENTIAL)
--covariance_fn=1 (MATERN32)
--covariance_fn=2 (MATERN52)
(-v=FRAC) --pca_variance=FRAC : sets the pca decomp to keep cpts up to variance fraction frac
options which incluence interactive_mode:
(-q) --quiet: run without any extraneous output
(-z) --pca_output: emulator output is left in the pca space
general options:
-h -? print the help dialogue
The defaults are regression_order=0 and covariance_fn=POWER_EXPONENTIAL.
These options will be saved in MODEL_SNAPSHOT_FILE.
INPUT_MODEL_FILE (With multi-output) FORMAT:
Models with multivariate output values y = y_1...y_t which we will
think of as rows of a matrix, in the following spec
number_outputs = t
BEGIN EXAMPLE
number_outputs
number_params
number_model_points
X[0,0]
...
X[0,number_params-1]
X[1,0]
...
X[number_model_points-1,number_params-1]
Y[0, 0]
Y[0, 1]
...
Y[0, number_outputs-1]
...
Y[1, 1]
...
Y[number_model_points-1, number_outputs-1]
END EXAMPLE
number_outputs, number_params and number_model_points should be
positive integers. X[i,j] and Y[i,j] will be read as
double-precision floats.
~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=
Example Analysis Using Interactive Emulator:
Several example analysis setups can be found in the test directory:
./test/uni-simple -> a test model with a single parameter and a 1 dimensional output
./test/uni-2d-param -> a test model with a 2 dimensional parameter
space and a 1 dimensional output
./test/multi-simple -> a test model with a 3 dimensional parameter
space and a 5 dimensional output
Each example in test has a script train-emulator.sh which will build
an emulator based on that particular model, the emulator can then be
sampled using sample-emulator.sh and figures can be built using the
enclosed R files.
~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=
Example Usage (R):
The file examples/simple-regression.R generates a set of model data
for a 1d model and then trains an emulator on this data. Build and
install the code base, fire up an R shell and load the file with
source("simple-regression.R").
If everything works OK you should see some output in the R terminal
and a plot of the emulator predictions (red) against the true model
(black) and the training points (green triangles). The output should
look somethign like examples/r-simple-plot.pdf.
It is interesting to try changing the number of model samples (the
argument to make.model) and the cov.fn and reg.order used in
estimate.model.
The file 2d-regression.R builds an emulator for a 2d model, makes
predictions over a grid and plots the predicted mean, variance, the
original model and the variance weighted squared difference:
D(u*) = sqrt{ (m(u*) - yM(u*))**2 / cov(u*) }
This is fun to play with.
~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=
Emulator Theory Summary:
Gaussian Process regression is a supervised learning method, a
representative set of samples of a model's output can be used to build
a predictive distribution for the output at untried locations in the
models parameter space. This code base has been developed to apply
this process to the output of complex computer codes.
Suppose we have a model Y_m(u) which produces scalar output as a
function of some set of input parameters u. A design is a set of
points D = {u_1, ..., u_n} in the u space at which we will evaluate
the model. A set of training outputs Y_t = {Y_m(u_1), ... Y_m(u_n)} is
produced by evaluating the model at each location in D. The sets D and
Y_t are then sufficient to 'train' a GP regression model or Emulator
to produce predictions at some new locations u*.
The training process conditions a Gaussian Process prior on the set
{Y_t, D} such that functions drawn from the ensuing posterior will
always pass through the training locations. A Gaussian Process (GP) is
a stochastic process for which any finite set of samples have a multi
variate normal distribution. In essence we force the Gaussian Process
to always generate functions which interpolate our data set.
The GP is specified in terms of a prior mean, usually chosen to be
zero or given by a simple regression model of the training data, and a
covariance function over the parameter space. The covariance function
specifies the correlation between pairs of parameter values u_i,
u_j. A power exponential form, with a nugget, is the default.
C(u_i, u_j) = theta_0 * exp(-1/2 ( u_i - u_j) ^2 / theta_l^2) + theta_1.
The covariance function itself is controlled by a set of
hyper-parameters theta_i, these act as characteristic length scales in
the parameter space and are a-priori unknown. The training process
consists of estimating the 'best' set of length scales for the given
model data set. This estimation process is done through numerical
maximization of the posterior hyper parameter likelihood, a fully
Bayesian treatment is also possible.
Once the GP has been conditioned, and a best set of hyper parameters
Theta has been obtained, we have a posterior predictive distribution
for the model output at a new location u*:
P(Y_m(u*) | Y_t, D, Theta) = MVN(m(u*), cov(u*)),
where the posterior mean and covariance are (for a zero prior mean)
m(u*) = C(u*)_(D)^t C(D,D)^{-1} Y_t
cov(u*) = C(u*,u*) - C(u*)_(D)^t C(D,D)^{-1} C(u*)_(D).
Here C(D,D) represents the matrix formed by evaluating the covariance
function C(u_i, u_j) at each of the locations in D, the vector
C(u*)_(D) = {C(u*, u_1), C(u*, u_2),...C(u*, u_n)} is the correlation
between each location in the design set and the new point.
These equations have a relatively straightforward interpretation. The
prior mean (0) is modified by the covariance structure deduced from
the data C(u*)_(Y_t)^t C(D,D)^{-1} Y_t, at u* = u_i where u_i is in D,
we can see that m(u_i) = 0. The prior covariance at u*, C(u*,u*) is
somewhat reduced by the training set C(u*)_(Y_t)^t C(D,D)^{-1}
C(u*)_(Y_t) and again at u*=u_i for u_i in D we reduce to cov(u_i) =
0. As such we have constructed an interpolating function.
It is important to note that unlike some other data
interpolation/extrapolation methods we end up with a probability
distribution for the value of our model at an untried location, as it
is normal the mean and covariance are sufficient to describe the
distribution. These can be obtained by the emulate_at_point or
emulate_at_list functions.
It is straightforward to draw samples from the predictive
distribution, although care must be taken to expand the above
equations correctly for a set of m observation locations
U* = {u*_1, ... u*_m}.
For more information see the documentation, the website
http://www.gaussianprocess.org and the invaluable book "Gaussian
processes for machine learning".