-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy path510_advancedModelBuilding.Rmd
324 lines (223 loc) · 11.6 KB
/
510_advancedModelBuilding.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
# Advanced Models
## Genetic Programming for Symbolic Regression
This technique is inspired by Darwin's evolution theory.
+ 1960s by I. Rechenberg in his work "Evolution strategies“
+ 1975 Genetic Algorithms (GAs) invented by J Holland and published in his book "Adaption in Natural and Artificial Systems“
+ 1992 J. Koza has used genetic algorithm to evolve programs to perform certain tasks. He called his method "genetic programming"
Other reference for GP: Langdon WB, Poli R (2001) Foundations of Genetic Programming. Springer.

- Depending on the function set used and the function to be minimised, GP can generate almost any type of curve



In R, we can use the "rgp" package
## Genetic Programming Example
### Load Data
```{r loadData, eval = FALSE}
# library(foreign)
#read data
# telecom1 <- read.table("./datasets/effortEstimation/Telecom1.csv", sep=",",header=TRUE, stringsAsFactors=FALSE, dec = ".")
# size_telecom1 <- telecom1$size
# effort_telecom1 <- telecom1$effort
# chinaTrain <- read.arff("./datasets/effortEstimation/china3AttSelectedAFPTrain.arff")
# china_train_size <- chinaTrain$AFP
# china_train_effort <- chinaTrain$Effort
# chinaTest <- read.arff("./datasets/effortEstimation/china3AttSelectedAFPTest.arff")
#china_size_test <- chinaTest$AFP
# actualEffort <- chinaTest$Effort
```
### Genetic Programming for Symbolic Regression: China dataset.
```{r geneticProgramingExample, message=FALSE, warning=FALSE, eval = FALSE}
# library("rgp")
# options(digits = 5)
# stepsGenerations <- 100
# initialPopulation <- 100
# Steps <- c(10)
# y <- china_train_effort #
# x <- china_train_size #
# data2 <- data.frame(y, x) # create a data frame with effort, size
# newFuncSet <- mathFunctionSet
# alternatives to mathFunctionSet
# newFuncSet <- expLogFunctionSet # sqrt", "exp", and "ln"
# newFuncSet <- trigonometricFunctionSet
# newFuncSet <- arithmeticFunctionSet
# newFuncSet <- functionSet("+","-","*", "/","sqrt", "log", "exp") # ,, )
# gpresult <- symbolicRegression(y ~ x,
# data=data2, functionSet=newFuncSet,
# populationSize=initialPopulation,
# stopCondition=makeStepsStopCondition(stepsGenerations))
# bf <- gpresult$population[[which.min(sapply(gpresult$population, gpresult$fitnessFunction))]]
# wf <- gpresult$population[[which.max(sapply(gpresult$population, gpresult$fitnessFunction))]]
# bf1 <- gpresult$population[[which.min((gpresult$fitnessValues))]]
# plot(x,y)
# lines(x, bf(x), type = "l", col="blue", lwd=3)
# lines(x,wf(x), type = "l", col="red", lwd=2)
# x_test <- china_size_test
# estim_by_gp <- bf(x_test)
# ae_gp <- abs(actualEffort - estim_by_gp)
# mean(ae_gp)
```
### Genetic Programming for Symbolic Regression. Telecom1 dataset.
- For illustration purposes only. We use all data points.
```{r, eval = FALSE}
# y <- effort_telecom1 # all data points
# x <- size_telecom1 #
#
# data2 <- data.frame(y, x) # create a data frame with effort, size
# # newFuncSet <- mathFunctionSet
# # alternatives to mathFunctionSet
# newFuncSet <- expLogFunctionSet # sqrt", "exp", and "ln"
# # newFuncSet <- trigonometricFunctionSet
# # newFuncSet <- arithmeticFunctionSet
# # newFuncSet <- functionSet("+","-","*", "/","sqrt", "log", "exp") # ,, )
#
# gpresult <- symbolicRegression(y ~ x,
# data=data2, functionSet=newFuncSet,
# populationSize=initialPopulation,
# stopCondition=makeStepsStopCondition(stepsGenerations))
#
# bf <- gpresult$population[[which.min(sapply(gpresult$population, gpresult$fitnessFunction))]]
# wf <- gpresult$population[[which.max(sapply(gpresult$population, gpresult$fitnessFunction))]]
#
# bf1 <- gpresult$population[[which.min((gpresult$fitnessValues))]]
# plot(x,y)
# lines(x, bf(x), type = "l", col="blue", lwd=3)
# lines(x,wf(x), type = "l", col="red", lwd=2)
```
## Neural Networks
A neural network (NN) simulates some of the learning functions of the human brain.
It can recognize patterns and "learn" . Through the use of a trial and error method the system “learns” to become an “expert” in the field.
A NN is composed of a set of nodes (units, neurons, processing elements)
+ Each node has input and output
+ Each node performs a simple computation by its node function
Weighted connections between nodes
+ Connectivity gives the structure/architecture of the net
+ What can be computed by a NN is primarily determined by the connections and their weights


There are several packages in R to work with NNs
+ [neuralnet](https://cran.r-project.org/web/packages/neuralnet/index.html)
+ [nnet](https://cran.r-project.org/web/packages/nnet/index.html)
+ [RSNNS](https://cran.r-project.org/web/packages/RSNNS/index.html)
TO BE FIXED!!!: The following is an example with the neuralnet package (TO DO, denormalize!). Neural nets need scaling of variables to work properly.
```{r NeuralNetExample, message=FALSE, warning=FALSE}
library(foreign)
library(neuralnet)
chinaTrain <- read.arff("datasets/effortEstimation/china3AttSelectedAFPTrain.arff")
afpsize <- chinaTrain$AFP
effort_china <- chinaTrain$Effort
chinaTest <- read.arff("datasets/effortEstimation/china3AttSelectedAFPTest.arff")
AFPTest <- chinaTest$AFP
actualEffort <- chinaTest$Effort
trainingdata <- cbind(afpsize,effort_china)
colnames(trainingdata) <- c("Input","Output")
testingdata <- cbind(afpsize,effort_china)
colnames(trainingdata) <- c("Input","Output")
#Normalize data
norm.fun = function(x){(x - min(x))/(max(x) - min(x))}
data.norm = apply(trainingdata, 2, norm.fun)
#data.norm
testdata.norm <- apply(trainingdata, 2, norm.fun)
#testdata.norm
#Train the neural network
#Going to have 10 hidden layers
#Threshold is a numeric value specifying the threshold for the partial
#derivatives of the error function as stopping criteria.
#net_eff <- neuralnet(Output~Input,trainingdata, hidden=5, threshold=0.25)
net_eff <- neuralnet(Output~Input, data.norm, hidden=10, threshold=0.01)
# Print the network
# print(net_eff)
#Plot the neural network
plot(net_eff)
#Test the neural network on some training data
#testdata.norm<-data.frame((testdata[,1] - min(data[, 'displ']))/(max(data[, 'displ'])-min(data[, 'displ'])),(testdata[,2] - min(data[, 'year']))/(max(data[, 'year'])-min(data[, 'year'])),(testdata[,3] - min(data[, 'cyl']))/(max(data[, 'cyl'])-min(data[, 'cyl'])),(testdata[,4] - min(data[, 'hwy']))/(max(data[, 'hwy'])-min(data[, 'hwy'])))
# Run them through the neural network
# net.results <- compute(net_eff, testdata.norm[,2])
#net.results <- compute(net_eff, dataTest.norm) # With normalized data
#Lets see what properties net.sqrt has
#ls(net.results)
#Lets see the results
#print(net.results$net.result)
#Lets display a better version of the results
#cleanoutput <- cbind(testdata.norm[,2],actualEffort,
# as.data.frame(net.results$net.result))
#colnames(cleanoutput) <- c("Input","Expected Output","Neural Net Output")
#print(cleanoutput)
```
## Support Vector Machines
SVM
## Ensembles
Ensembles or meta-learners combine multiple models to obtain better predictions i.e., this technique consists in combining single classifiers (sometimes are also called weak classifiers).
A problem with ensembles is that their models are difficult to interpret (they behave as blackboxes) in comparison to
decision trees or rules which provide an explanation of their
decision making process.
They are typically classified as Bagging, Boosting and Stacking (Stacked generalization).
### Bagging
Bagging (also known as Bootstrap aggregating) is an ensemble technique in which a base learner is applied to multiple equal size datasets created from the original data using bootstraping. Predictions are based on voting of the individual predictions. An advantage of bagging is that it does not require any modification to the learning algorithm and takes advantage of the instability of the base classifier to create diversity among individual ensembles so that individual members of the ensemble perform well in different regions of the data. Bagging does not perform well with classifiers if their output is robust to perturbation of the data such as
nearest-neighbour (NN) classifiers.
### Boosting
Boosting techniques generate multiple models that complement each other inducing models that improve regions of the data where previous induced models preformed poorly. This is achieved by increasing the weights of instances wrongly classified, so new learners focus on those instances. Finally, classification is based on a weighted voted among all members of the ensemble.
In particular, AdaBoost.M1 [15] is a popular boosting algorithm for classification. The set of training examples is assigned an equal weight at the beginning and the weight of instances is either increased or
decreased depending on whether the learner classified that instance incorrectly or not. The following iterations focus on those instances with higher weights. AdaBoost.M1 can be applied to any base learner.
### Rotation Forests
Rotation Forests [40] combine randomly chosen subsets of attributes (random subspaces) and bagging approaches with principal components feature generation to construct an ensemble of decision trees. Principal Component Analysis is used as a feature selection technique combining subsets of
attributes which are used with a bootstrapped subset of the training data by the base classifier.
### Boosting in R
In R, there are three packages to deal with Boosting: gmb, ada and the mboost packages. An example of gbm using the caret package.
```{r}
# load libraries
library(caret)
library(pROC)
#################################################
# model it
#################################################
# Get names of caret supported models (just a few - head)
head(names(getModelInfo()))
# Show model info and find out what type of model it is
getModelInfo()$gbm$tags
getModelInfo()$gbm$type
```
Example
```{r}
library(foreign)
library(caret)
library(pROC)
kc1 <- read.arff("./datasets/defectPred/D1/KC1.arff")
# Split data into training and test datasets
# TODO: Improve this with createDataParticion from Caret
set.seed(1234)
ind <- sample(2, nrow(kc1), replace = TRUE, prob = c(0.7, 0.3))
kc1.train <- kc1[ind==1, ]
kc1.test <- kc1[ind==2, ]
# create caret trainControl object to control the number of cross-validations performed
objControl <- trainControl(method='cv', number=3, returnResamp='none', summaryFunction = twoClassSummary, classProbs = TRUE)
# run model
objModel <- train(Defective ~ .,
data = kc1.train,
method = 'gbm',
trControl = objControl,
metric = "ROC" #,
#preProc = c("center", "scale")
)
# Find out variable importance
summary(objModel)
# find out model details
objModel
```
Evaluate model
```{r}
#################################################
# evalutate model
#################################################
# get predictions on your testing data
# class prediction
predictions <- predict(object=objModel, kc1.test[,-22], type='raw')
head(predictions)
postResample(pred=predictions, obs=as.factor(kc1.test[,22]))
# probabilities
predictions <- predict(object=objModel, kc1.test[,-22], type='prob')
head(predictions)
postResample(pred=predictions[[2]], obs=ifelse(kc1.test[,22]=='yes',1,0))
auc <- roc(ifelse(kc1.test[,22]=="Y",1,0), predictions[[2]])
print(auc$auc)
```