-
Notifications
You must be signed in to change notification settings - Fork 0
Simple machine learning task example
Here I'm going to give step-by-step instructions how to do simple machine learning task using dinrhiw tools: dstool and nntool.
First compile and install library and tools:
autoconf; ./configure; make; make makelib; su; make install; exit;
and
cd tools; make all; su; make install; exit;
After this, lets get some data from UCI machine learning repository http://archive.ics.uci.edu/ml/.
Specifically, we are going to do machine learning to predict wine quality of different wines based on measurements: https://archive.ics.uci.edu/ml/datasets/Wine+Quality.
wget http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
The data looks good but requires some preprocessing to transform it into the CSV format. For preprocessing we are going to use perl and write a short script that divides data into 1) input and output files, 2) removes initial labels and 3) removes separators between numbers.
#!/usr/bin/perl
open(MYFILE, 'winequality-white.csv');
open(OUT, ">winequality.out");
open(IN, ">winequality.in");
$firstLine = <MYFILE>;
while (<MYFILE>) {
chomp;
@words = split(";", $_);
for my $i (0 .. $#words - 1){
print IN "@words[$i] ";
}
print IN "\n";
print OUT "@words[$#words]\n";
}
close(MYFILE);
close(OUT);
close(IN);
OK. That was simple but not totally straight-forward to write but once you have write few of these conversion scripts they are simple variations of each other.
Now we have two files winequality.in and winequality.out. The first one contains input variables of our problem and the other one the corresponding output values (wine quality).
The next things we are going to do is to use dstool to import the data into binary format nntool can use and do some simple preprocessing for data in order for it to have correct range ~[-1,1] when using neural network.
Initially, we create a new dataset file using create command and clusters using yet another create commands where we specify number of variables in clusters. After it we are then ready to import our datafiles (winequality.in and winequality.out) using import commands for both clusters (0 and 1). Once our data is inside the file, we then add preprocessings (removal of mean and variance using meanvar and removal of correlations using pca).
dstool -create wine-test.ds
dstool -create:11:input wine-test.ds
dstool -create:1:output wine-test.ds
dstool -list wine-test.ds
dstool -import:0 wine-test.ds winequality.in
dstool -import:1 wine-test.ds winequality.out
dstool -padd:0:meanvar wine-test.ds
dstool -padd:0:pca wine-test.ds
dstool -padd:1:meanvar wine-test.ds
After we have created our dataset files to be used for machine learning, it is now time to do the actual job using nntool command.
Nntool accepts many of different parameters but basically we just need to tell nntool what optimization method to use (we are going to use parallelbfgs) and then define how many iterations or seconds the optimization process should run. The command:
nntool -v --time 200 wine-test.ds ?-20-? winenn.cfg parallelbfgs
Tells nntool to be:
- be verbose (-v)
- use 200 seconds for optimization (--time)
- read data from wine-test.ds file
- specifies neural network architecture using special **?-20-?*' marking where ? is input/output dimension that is autodetected by nntool from dataset file (so the network architecture will be 11-20-1)
- defines configuration file where the model parameters are saved (or loaded): winenn.cfg
- and finally sets the optimization method parallelbfgs (parallel multistart bfgs method which is the best for small networks).
- there is cross-validation (early stopping) because --overfit parameter is not given meaning that data is divided to training and testing sets and optimization in the training dataset is only continued as long as it improves performance in a testing dataset
After running 200 seconds the program stores the results into winenn.cfg file which is used immediately to calculate average error in a dataset (the numbers displayed during optimization process are approximations).
nntool -v wine-test.ds ?-20-? winenn.cfg use
The command reports the average error of the prediction model to be: 0.0188697 which is rather good for the given dataset. (For example [http://sourceforge.net/projects/rapidminer/ Rapid Miner] basic neural network optimizer gives worse results).
How to interprete the error terms? When the data is properly normalized, results larger than 0.01 mean that the mathematical model cannot fully fit to the data meaning that errors are unacceptably high. And values much below 0.01 mean that the program manages to learn the data well. The results 0.0188697 then means that with proper tuning it might be possible to get good enough results (try to increase optimization time and size of the network slightly).
Finally, after learning the model, we actually re-predict the datapoints and compare them to the real outputs. These steps can be done using the following commands:
#predicting [stores results to dataset]
cp -f wine-test.ds wine-pred.ds
dstool -clear:1 wine-pred.ds
# ./dstool -remove:1 wine-pred.ds
nntool -v wine-pred.ds ?-20-? winenn.cfg use
dstool -list wine-test.ds
dstool -list wine-pred.ds
dstool -print:1:4886:4897 wine-pred.ds
tail winequality.out
Alternatively, one could also use dstool's export command to store predicted datapoints into ASCII file instead of just printing few of them.