Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split postprocessor.tex into separate files #1817

Merged
merged 2 commits into from
Apr 25, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
270 changes: 270 additions & 0 deletions doc/user_manual/PostProcessors/BasicStatistics.tex

Large diffs are not rendered by default.

124 changes: 124 additions & 0 deletions doc/user_manual/PostProcessors/ComparisonStatistics.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
\subsubsection{ComparisonStatistics}
\label{ComparisonStatistics}
The \textbf{ComparisonStatistics} post-processor computes statistics
for comparing two different dataObjects. This is an experimental
post-processor, and it will definitely change as it is further
developed.

There are four nodes that are used in the post-processor.

\begin{itemize}
\item \xmlNode{kind}: specifies information to use for comparing the
data that is provided. This takes either uniformBins which makes
the bin width uniform or equalProbability which makes the number
of counts in each bin equal. It can take the following attributes:
\begin{itemize}
\item \xmlAttr{numBins} which takes a number that directly
specifies the number of bins
\item \xmlAttr{binMethod} which takes a string that specifies the
method used to calculate the number of bins. This can be either
square-root or sturges.
\end{itemize}
\item \xmlNode{compare}: specifies the data to use for comparison.
This can either be a normal distribution or a dataObjects:
\begin{itemize}
\item \xmlNode{data}: This will specify the data that is used. The
different parts are separated by $|$'s.
\item \xmlNode{reference}: This specifies a reference distribution
to be used. It takes distribution to use that is defined in the
distributions block. A name parameter is used to tell which
distribution is used.
\end{itemize}
\item \xmlNode{fz}: If the text is true, then extra comparison
statistics for using the $f_z$ function are generated. These take
extra time, so are not on by default.
\item \xmlNode{interpolation}: This switches the interpolation used
for the cdf and the pdf functions between the default of quadratic
or linear.
\end{itemize}

The \textbf{ComparisonStatistics} post-processor generates a variety
of data. First for each data provided, it calculates bin boundaries,
and counts the numbers of data points in each bin. From the numbers
in each bin, it creates a cdf function numerically, and from the cdf
takes the derivative to generate a pdf. It also calculates statistics
of the data such as mean and standard deviation. The post-processor
can generate a CSV file only.

The post-processor uses the generated pdf and cdf function to
calculate various statistics. The first is the cdf area difference which is:
\begin{equation}
cdf\_area\_difference = \int_{-\infty}^{\infty}{\|CDF_a(x)-CDF_b(x)\|dx}
\end{equation}
This given an idea about how far apart the two pieces of data are, and
it will have units of $x$.

The common area between the two pdfs is calculated. If there is
perfect overlap, this will be 1.0, if there is no overlap, this will
be 0.0. The formula used is:
\begin{equation}
pdf\_common\_area = \int_{-\infty}^{\infty}{\min(PDF_a(x),PDF_b(x))}dx
\end{equation}

The difference pdf between the two pdfs is calculated. This is calculated as:
\begin{equation}
f_Z(z) = \int_{-\infty}^{\infty}f_X(x)f_Y(x-z)dx
\end{equation}
This produces a pdf that contains information about the difference
between the two pdfs. The mean can be calculated as (and will be
calculated only if fz is true):
\begin{equation}
\bar{z} = \int_{-\infty}^{\infty}{z f_Z(z)dz}
\end{equation}
The mean can be used to get an signed difference between the pdfs,
which shows how their means compare.

The variance of the difference pdf can be calculated as (and will be
calculated only if fz is true):
\begin{equation}
var = \int_{-\infty}^{\infty}{(z-\bar{z})^2 f_Z(z)dz}
\end{equation}

The sum of the difference function is calculated if fz is true, and is:
\begin{equation}
sum = \int_{-\infty}^{\infty}{f_z(z)dz}
\end{equation}
This should be 1.0, and if it is different that
points to approximations in the calculation.


\textbf{Example:}
\begin{lstlisting}[style=XML]
<Simulation>
...
<Models>
...
<PostProcessor name="stat_stuff" subType="ComparisonStatistics">
<kind binMethod='sturges'>uniformBins</kind>
<compare>
<data>OriData|Output|tsin_TEMPERATURE</data>
<reference name='normal_410_2' />
</compare>
<compare>
<data>OriData|Output|tsin_TEMPERATURE</data>
<data>OriData|Output|tsout_TEMPERATURE</data>
</compare>
</PostProcessor>
<PostProcessor name="stat_stuff2" subType="ComparisonStatistics">
<kind numBins="6">equalProbability</kind>
<compare>
<data>OriData|Output|tsin_TEMPERATURE</data>
</compare>
<Distribution class='Distributions' type='Normal'>normal_410_2</Distribution>
</PostProcessor>
...
</Models>
...
<Distributions>
<Normal name='normal_410_2'>
<mean>410.0</mean>
<sigma>2.0</sigma>
</Normal>
</Distributions>
</Simulation>
\end{lstlisting}
240 changes: 240 additions & 0 deletions doc/user_manual/PostProcessors/CrossValidation.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
\subsubsection{CrossValidation}
\label{CVPP}
The \textbf{CrossValidation} post-processor is specifically used to evaluate estimator (i.e., ROMs) performance.
Cross-validation is a statistical method of evaluating and comparing learning algorithms by dividing data into
two portions: one used to `train' a surrogate model and the other used to validate the model, based on specific
scoring metrics. In typical cross-validation, the training and validation sets must crossover in successive
rounds such that each data point has a chance of being validated against the various sets. The basic form of
cross-validation is k-fold cross-validation. Other forms of cross-validation are special cases of k-fold or involve
repeated rounds of k-fold cross-validation. \nb It is important to notice that this post-processor currently can
only accept \textbf{PointSet} data object.
%
\ppType{CrossValidation}{CrossValidation}
%
\begin{itemize}
\item \xmlNode{SciKitLearn}, \xmlDesc{string, required field}, the subnodes specifies the necessary information
for the algorithm to be used in the post-processor. `SciKitLearn' is based on algorithms in SciKit-Learn
library, and currently it performs cross-validation over \textbf{PointSet} only.
\item \xmlNode{Metric}, \xmlDesc{string, required field}, specifies the \textbf{Metric} name that is defined via
\textbf{Metrics} entity. In this xml-node, the following xml attributes need to be specified:
\begin{itemize}
\item \xmlAttr{class}, \xmlDesc{required string attribute}, the class of this metric (e.g. Metrics)
\item \xmlAttr{type}, \xmlDesc{required string attribute}, the sub-type of this Metric (e.g. SKL, Minkowski)
\end{itemize}
\nb Currently, cross-validation post-processor only accepts \xmlNode{SKL} metrics with \xmlNode{metricType}
\xmlString{mean\_absolute\_error}, \xmlString{explained\_variance\_score}, \xmlString{r2\_score},
\xmlString{mean\_squared\_error}, and \xmlString{median\_absolute\_error}.
\end{itemize}

\textbf{Example:}

\begin{lstlisting}[style=XML]
<Simulation>
...
<Files>
<Input name="output_cv" type="">output_cv.xml</Input>
<Input name="output_cv.csv" type="">output_cv.csv</Input>
</Files>
<Models>
...
<ROM name="surrogate" subType="SciKitLearn">
<SKLtype>linear_model|LinearRegression</SKLtype>
<Features>x1,x2</Features>
<Target>ans</Target>
<fit_intercept>True</fit_intercept>
<normalize>True</normalize>
</ROM>
<PostProcessor name="pp1" subType="CrossValidation">
<SciKitLearn>
<SKLtype>KFold</SKLtype>
<n_splits>3</n_splits>
<shuffle>False</shuffle>
</SciKitLearn>
<Metric class="Metrics" type="SKL">m1</Metric>
</PostProcessor>
...
</Models>
<Metrics>
<SKL name="m1">
<metricType>mean_absolute_error</metricType>
</SKL>
</Metrics>
<Steps>
<PostProcess name="PP1">
<Input class="DataObjects" type="PointSet">outputDataMC</Input>
<Input class="Models" type="ROM">surrogate</Input>
<Model class="Models" type="PostProcessor">pp1</Model>
<Output class="Files" type="">output_cv</Output>
<Output class="Files" type="">output_cv.csv</Output>
</PostProcess>
</Steps>
...
</Simulation>
\end{lstlisting}

In order to access the results from this post-processor, RAVEN will define the variables as ``cv'' +
``\_'' + ``MetricName'' + ``\_'' + ``ROMTargetVariable'' to store the calculation results, and these
variables are also accessible by the users through RAVEN entities \textbf{DataObjects} and \textbf{OutStreams}.
In previous example, variable \textit{cv\_m1\_ans} are accessible by the users.

\paragraph{SciKitLearn}

The algorithm for cross-validation is chosen by the subnode \xmlNode{SKLtype} under the parent node \xmlNode{SciKitLearn}.
In addition, a special subnode \xmlNode{average} can be used to obtain the average cross validation results.

\begin{itemize}
\item \xmlNode{SKLtype}, \xmlDesc{string, required field}, contains a string that
represents the cross-validation algorithm to be used. As mentioned, its format is:

\xmlNode{SKLtype}algorithm\xmlNode{/SKLtype}.
\item \xmlNode{average}, \xmlDesc{boolean, optional field}, if `True`, dump the average cross validation results into the
output files.
\end{itemize}


Based on the \xmlNode{SKLtype} several different algorithms are available. In the following paragraphs a brief
explanation and the input requirements are reported for each of them.

\paragraph{K-fold}
\textbf{KFold} divides all the samples in $k$ groups of samples, called folds (if $k=n$, this is equivalent to the
\textbf{Leave One Out} strategy), of equal sizes (if possible). The prediction function is learned using $k-1$ folds,
and fold left out is used for test.
In order to use this algorithm, the user needs to set the subnode:
\xmlNode{SKLtype}KFold\xmlNode{/SKLtype}.
In addition to this XML node, several others are available:
\begin{itemize}
\item \xmlNode{n\_splits}, \xmlDesc{integer, optional field}, number of folds, must be at least 2. \default{3}
\item \xmlNode{shuffle}, \xmlDesc{boolean, optional field}, whether to shuffle the data before splitting into
batches.
\item \xmlNode{random\_state}, \xmlDesc{integer, optional field}, when shuffle=True,
pseudo-random number generator state used for shuffling. If not present, use default numpy RNG for shuffling.
\end{itemize}

\paragraph{Stratified k-fold}
\textbf{StratifiedKFold} is a variation of \textit{k-fold} which returns stratified folds: each set contains approximately
the same percentage of samples of each target class as the complete set.
In order to use this algorithm, the user needs to set the subnode:

\xmlNode{SKLtype}StratifiedKFold\xmlNode{/SKLtype}.

In addition to this XML node, several others are available:
\begin{itemize}
\item \xmlNode{labels}, \xmlDesc{list of integers, (n\_samples), required field}, contains a label for each sample.
\item \xmlNode{n\_splits}, \xmlDesc{integer, optional field}, number of folds, must be at least 2. \default{3}
\item \xmlNode{shuffle}, \xmlDesc{boolean, optional field}, whether to shuffle the data before splitting into
batches.
\item \xmlNode{random\_state}, \xmlDesc{integer, optional field}, when shuffle=True,
pseudo-random number generator state used for shuffling. If not present, use default numpy RNG for shuffling.
\end{itemize}

\paragraph{Label k-fold}
\textbf{LabelKFold} is a variation of \textit{k-fold} which ensures that the same label is not in both testing and
training sets. This is necessary for example if you obtained data from different subjects and you want to avoid
over-fitting (i.e., learning person specific features) by testing and training on different subjects.
In order to use this algorithm, the user needs to set the subnode:

\xmlNode{SKLtype}LabelKFold\xmlNode{/SKLtype}.

In addition to this XML node, several others are available:
\begin{itemize}
\item \xmlNode{labels}, \xmlDesc{list of integers with length (n\_samples, ), required field}, contains a label for
each sample. The folds are built so that the same label does not appear in two different folds.
\item \xmlNode{n\_splits}, \xmlDesc{integer, optional field}, number of folds, must be at least 2. \default{3}
\end{itemize}

\paragraph{Leave-One-Out - LOO}
\textbf{LeaveOneOut} (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples
except one, the test set being the sample left out. Thus, for $n$ samples, we have $n$ different training sets and
$n$ different tests set. This is cross-validation procedure does not waste much data as only one sample is removed from
the training set.
In order to use this algorithm, the user needs to set the subnode:

\xmlNode{SKLtype}LeaveOneOut\xmlNode{/SKLtype}.

\paragraph{Leave-P-Out - LPO}
\textbf{LeavePOut} is very similar to \textbf{LeaveOneOut} as it creates all the possible training/test sets by removing
$p$ samples from the complete set. For $n$ samples, this produces $(^n_p)$ train-test pairs. Unlike \textbf{LeaveOneOut}
and \textbf{KFold}, the test sets will overlap for $p > 1$.
In order to use this algorithm, the user needs to set the subnode:

\xmlNode{SKLtype}LeavePOut\xmlNode{/SKLtype}.

In addition to this XML node, several others are available:
\begin{itemize}
\item \xmlNode{p}, \xmlDesc{integer, required field}, size of the test sets
\end{itemize}

\paragraph{Leave-One-Label-Out - LOLO}
\textbf{LeaveOneLabelOut} (LOLO) is a cross-validation scheme which holds out the samples according to a third-party
provided array of integer labels. This label information can be used to encode arbitrary domain specific pre-defined
cross-validation folds. Each training set is thus constituted by all samples except the ones related to a specific
label.
In order to use this algorithm, the user needs to set the subnode:

\xmlNode{SKLtype}LeaveOneLabelOut\xmlNode{/SKLtype}.

In addition to this XML node, several others are available:
\begin{itemize}
\item \xmlNode{labels}, \xmlDesc{list of integers, (n\_samples,), required field}, arbitrary
domain-specific stratificatioin of the data to be used to draw the splits.
\end{itemize}

\paragraph{Leave-P-Label-Out}
\textbf{LeavePLabelOut} is imilar as \textit{Leave-One-Label-Out}, but removes samples related to $P$ labels for
each training/test set.
In order to use this algorithm, the user needs to set the subnode:

\xmlNode{SKLtype}LeavePLabelOut\xmlNode{/SKLtype}.

In addition to this XML node, several others are available:
\begin{itemize}
\item \xmlNode{labels}, \xmlDesc{list of integers, (n\_samples,), required field}, arbitrary
domain-specific stratificatioin of the data to be used to draw the splits.
\item \xmlNode{n\_groups}, \xmlDesc{integer, optional field}, number of samples to leave out in the test split.
\end{itemize}

\paragraph{ShuffleSplit}
\textbf{ShuffleSplit} iterator will generate a user defined number of independent train/test dataset splits. Samples
are first shuffled and then split into a pair of train and test sets. it is possible to control the randomness for
reproducibility of the results by explicitly seeding the \xmlNode{random\_state} pseudo random number generator.
In order to use this algorithm, the user needs to set the subnode:

\xmlNode{SKLtype}ShuffleSplit\xmlNode{/SKLtype}.

In addition to this XML node, several others are available:
\begin{itemize}
\item \xmlNode{n\_splits}, \xmlDesc{integer, optional field}, number of re-shuffling and splitting iterations
\default{10}.
\item \xmlNode{test\_size}, \xmlDesc{float or integer, optional field}, if float, should be between 0.0 and 1.0 and
represent the proportion of the dataset to include in the test split. \default{0.1}
If integer, represents the absolute number of test samples. If not present, the value is automatically set to
the complement of the train size.
\item \xmlNode{train\_size}, \xmlDesc{float or integer, optional field}, if float, should be between 0.0 and 1.0 and represent
the proportion of the dataset to include in the train split. If integer, represents the absolute number of train
samples. If not present, the value is automatically set to the complement of the test size.
\item \xmlNode{random\_state}, \xmlDesc{integer, optional field}, when shuffle=True,
pseudo-random number generator state used for shuffling. If not present, use default numpy RNG for shuffling.
\end{itemize}

\paragraph{Label-Shuffle-Split}
\textbf{LabelShuffleSplit} iterator behaves as a combination of \textbf{ShuffleSplit} and \textbf{LeavePLabelOut},
and generates a sequence of randomized partitions in which a subset of labels are held out for each split.
In order to use this algorithm, the user needs to set the subnode:

\xmlNode{SKLtype}LabelShuffleSplit\xmlNode{/SKLtype}.

In addition to this XML node, several others are available:
\begin{itemize}
\item \xmlNode{labels}, \xmlDesc{list of integers, (n\_samples)}, labels of samples.
\item \xmlNode{n\_splits}, \xmlDesc{integer, optional field}, number of re-shuffling and splitting iterations
\default{10}.
\item \xmlNode{test\_size}, \xmlDesc{float or integer, optional field}, if float, should be between 0.0 and 1.0 and
represent the proportion of the dataset to include in the test split. \default{0.1}
If integer, represents the absolute number of test samples. If not present, the value is automatically set to
the complement of the train size.
\item \xmlNode{train\_size}, \xmlDesc{float or integer, optional field}, if float, should be between 0.0 and 1.0 and represent
the proportion of the dataset to include in the train split. If integer, represents the absolute number of train
samples. If not present, the value is automatically set to the complement of the test size.
\item \xmlNode{random\_state}, \xmlDesc{integer, optional field}, when shuffle=True,
pseudo-random number generator state used for shuffling. If not present, use default numpy RNG for shuffling.
\end{itemize}
Loading