Skip to content

Unix Tools Documentation

Tomas Ukkonen edited this page May 23, 2017 · 2 revisions

dstool

Dstool is a helper program for nntool which will be probably merged with nntool at some point. It reads comma-separated CVS-formatted ASCII data and stores it into a format nntool can understand.

DATAFILE <=> DSTOOL <=> NNTOOL

Command-line arguments of dstool are:

Usage: datatool <command> <datafile> [asciifile | datafile]
A tool for manipulating whiteice::dataset files.

 -list                      lists clusters, number of datapoints, preprocessings.
                            (default action)
 -print[:<c1>[:<b>[:<e>]]]  prints contents of cluster c1 (indexes [<b>,<e>])
 -create                    creates new empty dataset (<dataset> file doesn't exist)
 -create:<dim>[:name]       creates new empty <dim> dimensional dataset cluster
 -import:<c1>               imports data from comma separated CSV ascii file to cluster c1
 -export:<c1>               exports data from cluster c1 to comma separated CSV ascii file
 -add:<c1>:<c2>             adds data from another datafile cluster c2 to c1
 -move:<c1>:<c2>            moves data (internally) from cluster c2 to c1
 -copy:<c1>:<c2>            copies data (internally) from cluster c2 to c1
 -clear:<c1>                clears dataset cluster c1 (but doesn't remove cluster)
 -remove:<c1>               removes dataset cluster c1
 -padd:<c1>:<name>+         adds preprocessing(s) to cluster
 -premove:<c1>:<name>+      removes preprocesing(s) from cluster
                            preprocess names: meanvar, outlier, pca, ica
                            note: ica implementation is unstable and may not work
 -data:N                    jointly resamples all cluster sizes down to N datapoints

This program is distributed under LGPL license <dinrhiw2.googlecode.com>.
  1. The list command lists dataset information: the number of datapoints, clusters and their preprocessing methods. It exists just for informational and testing purposes.
  2. Similarly, the print command prints given clusters datapoints for debugging purposes.
  3. The create command creates a new cluster into dataset containing D dimensions. The importing of data into dataset always starts with this command.
  4. import and export commands are used to bring and save data to/from dataset and are most important commands that one needs to use when wanting to process datafile with nntool.
  5. add, move, copy, clear and remove commands are seldom used and can be used to move datapoints between clusters
  6. the padd command another essential command that can be used to preprocess data before it is fed to neural network. The meanvar is maybe the most essential command here as it removed mean and normalizes variance of (input) data to unity making it nice to work with nntool's neural network implementation.
  7. premove command can be used to remove preprocessings from data.
  8. data command JOINTLY downsamples number of datapoints from all clusters down to N datapoints. This requires that number of datapoints in each cluster is the same in each cluster (the usual case).

nntool

NNtool is the primary program in dinrhiw2. It is used to create, train and use neural networks. It reads dataset files created by dstool and stores results back to dataset files or text-formated neural network configuration files. It doesn't show results anyway but only prints some machine learning statistics and estimated time it takes to complete the (time consuming) commands. Many of its algorithms take advantage of multicore CPUs through multithreading and use optimized BLAS-libraries to carry out computations.

Arguments of the program are:

Usage: nntool [options] [data] [arch] <nnfile> [lmethod]
Create, train and use neural network(s).

-v             shows ETA and other details
--help         shows this help
--version      displays version and exits
--no-init      do not use heuristics when initializing nn weights
--overfit      do not use early stopping (bfgs,lbfgs)
--adaptive     use adaptive step length in bayesian hamiltonian monte carlo (bayes)
--negfb        use negative feedback between neurons (grad,parallelgrad,bfgs,lbfgs)
--load         use previously computed network weights as the starting point (grad,bfgs,lbfgs,bayes)
--time TIME    sets time limit for multistart optimization and bayesian inference
--samples N    samples N samples or defines max iterations (eg. 2500) to be used in optimization/sampling
--threads N    uses N parallel threads when looking for solution
--data N       takes randomly N samples of data for the learning process (N/2 used in training)
[data]         a source file for inputs or i/o examples (binary file)
               (whiteice data file format created by dstool)
[arch]         the architecture of a new nn. Eg. 3-10-9 or ?-10-?
<nnfile>       input/output neural networks weights file
[lmethod]      method: use, random, grad, parallelgrad, bayes, lbfgs, parallelbfgs, parallellbfgs
               parallel methods use random location multistart/restart parallel search
               until timeout or the number of samples has been reached
               additionally: minimize method finds input that minimizes the neural network output
               gradient descent algorithms use negative feedback heuristic

               Ctrl-C shutdowns the program gracefully.

Report bugs to <dinrhiw2.googlecode.com>.
  1. The v option tells nntool to be more verbose printing out many informational messages about what is happening.
  2. The help option prints out long list of commands.
  3. No-init option tells nntool not to use default weight initialization heuristics (still used by parallelgrad,bfgs,lbfgs) but start from a random weight initialization.
  4. Overfit option tells nntool to continue optimization desprite the cross-validation (testing dataset) tells it should do early stopping (the default).
  5. Adaptive option is used my bayes/hmc sampler to use adaptive step length heuristics which should improve sampler's convergence to the target distribution (posterior distribution of neural network weights).
  6. The negfb option activates "negative feedback between neurons" heuristic during training which can be very useful to force different neurons to behave independently (otherwise different neuronal units can be correlated) and use neural network's capacity efficiently.
  7. Load option tells to load network from the disk and start training from it instead of starting from the scratch. This is useful when you want to use different training algorithms in sequence.
  8. Time and samples options can be used to specify the time used or the samples (or iterations) collected during the learning process. Very useful commands when the training processes can take several hours.
  9. The threads option is used specify number of threads used by nntool. It is used by parallel algorithms (as a default nntool autodetects the number of processor cores and uses that number of theads).
  10. Data option is same as in dstool. When reading data from the dataset, the number of datapoints is downsampled to N datapoints before they are used.

The other arguments (non-options) are:

  1. Data the dataset file to read and use (from dstool).
  2. Arch the architecture of the neural network that will be trained or used. Specifies number of units per layer. For example, 10-5-2 tells to train neural network with 10 inputs, 5 units in a hidden layer and 2 units in a output layer.
  3. Nnfile neural network configuration filename where the results of the training are stored to (or read from)
  4. Lmethod is maybe to the most important parameter of nntool. It specifies the learning method that will be used: random means to use random search, grad means to use gradient descent, bayes causes nntool to use HMC sampling, lbfgs uses LBFGS optimization, parallelgrad uses multistart parallel multithreaded gradient descent, parallelbfgs uses multistart parallel multithreaded BFGS optimization and finally parallellbfgs uses multistart parallel multithreaded L-BFGS optimization.

The parallel multithreaded optimization methods operate using the same logic. N threads are started that work in paralel which is runs independent optimization process, checks for convergence and stores result as the global best if a better result is found - and restarts the search from the start until timeout or number of iterations has been reached (samples).

Additionally there is use method which operates in two modes. If the dataset has two clusters (like in training mode). The neural network model is used to predict the results and average squared prediction error in data is computed. Alternatively, if the dataset has a single (input) cluster and another empty cluster. Then predicted results are stored into empty cluster which can be then printed or exported into CSV-formated ASCII file using dstool.