In addition to the unit tests designed to verify correctness, Ginkgo also includes an extensive benchmark suite for checking its performance on all Ginkgo supported systems. The purpose of Ginkgo's benchmarking suite is to allow easy and complete reproduction of Ginkgo's performance, and to facilitate performance debugging as well. Most results published in Ginkgo papers are generated thanks to this benchmarking suite and are accessible online under the ginkgo-data repository. These results can also be used for performance comparison in order to ensure that you get similar performance as what is published on this repository.
To compile the benchmarks, the flag -DGINKGO_BUILD_BENCHMARKS=ON
has to be set
during the cmake
step. In addition, the ssget
command-line
utility has to be installed on the
system. The purpose of this file is to explain in detail the capacities of this
benchmarking suite as well as how to properly setup everything.
There are two ways to benchmark Ginkgo. When compiling the benchmark suite,
executables are generated for collecting matrix statistics, running
sparse-matrix vector product, solvers (possibly distributed) benchmarks. Another
way to run benchmarks is through the convenience script run_all_benchmarks.sh
,
but not all features are exposed through this tool!
Here is a short description of the content of this file:
- Ginkgo setup and best practice guidelines
- Installing and using the
ssget
tool to fetch the SuiteSparse matrices. - Running benchmarks manually
- Benchmarking with the script utility
- How to publish the benchmark results online and use the Ginkgo Performance Explorer (GPE) for performance analysis (optional).
- Using the benchmark suite for performance debugging thanks to the loggers.
- Available benchmark customization options with the script utility.
Before benchmarking Ginkgo, make sure that you follow the general guidelines in order to ensure best performance.
- The code should be compiled in
Release
mode. - Make sure the machine has no competing jobs. On a Linux machine multiple
commands can be used,
last
shows the currently opened sessions,top
orhtop
allows to show the current machine load, and if considering using specific GPUs,nvidia-smi
orrocm-smi
can be used to check their load. - By default, Ginkgo's benchmarks will always do at least one warm-up run. For better accuracy, every benchmark is also averaged over 10 runs, except for the solver benchmark which are usually fairly long. These parameters can be tuned at the command line to either shorten benchmarking time or improve benchmarking accuracy.
In addition, the following specific options can be considered:
- When specifically using the adaptive block jacobi preconditioner, enable
the
GINKGO_JACOBI_FULL_OPTIMIZATIONS
CMake flag. Be careful that this will use much more memory and time for the compilation due to compiler performance issues with register optimizations, in particular. - The current benchmarking setup also allows to benchmark only the overhead by
using as either (or for all) preconditioner/spmv/solver, the special
overhead
LinOp. If your purpose is to check Ginkgo's overhead, make sure to try this mode.
To benchmark ginkgo
, matrices need to be provided as input in the Matrix Market
format. A convenient way is to run benchmark with the SuiteSparse
matrix collection. A helper tool, the ssget
command-line utility can be used to
facilitate downloading and extracting matrices from the suitesparse collection.
When running the benchmarks with the helper script run_all_benchmarks.sh
(or
calling make benchmark
), the ssget
tool is required.
To install ssget
, access the repository and copy the file ssget
into a
directory present in your PATH
variable as per the tool's README.md
instructions. The tool can be installed either in a global system path or a
local directory such as $HOME/.local/bin
. After installing the tool, it is
important to review the ssget
script and configure as needed the variable
ARCHIVE_LOCATION
on line 39. This is where the matrices will be stored into.
The Ginkgo benchmark can be set to run on only a portion of the SuiteSparse matrix collection as we will see in the following section. Please note that the entire collection requires roughly 100GB of disk storage in its compressed format, and roughly 25GB of additional disk space for intermediate data (such us uncompressing the archive). Additionally, the benchmark runs usually take a long time (SpMV benchmarks on the complete collection take roughly 24h using the K20 GPU), and will stress the system.
Before proceeding, it can be useful in order to save time to download the
matrices as preparation. This can be done by using the ssget -f -i i
command
where i
is the ID of the matrix to be downloaded. The following loop allows
to download the full SuiteSparse matrix collection:
for i in $(seq 0 $(ssget -n)); do
ssget -f -i ${i}
done
Note that ssget
can also be used to query properties of the matrix and filter
the matrices which are downloaded. For example, the following will download only
positive definite matrices with less than 500M non zero elements and 10M
columns. Please refer to the ssget
documentation
for more information.
for i in $(seq 0 $(ssget -n)); do
posdef=$(ssget -p posdef -i ${i})
cols=$(ssget -p cols -i ${i})
nnz=$(ssget -p nonzeros -i ${i})
if [ "$posdef" -eq 1 -a "$cols" -lt 10000000 -a "$nnz" -lt 500000000 ]; then
ssget -f -i ${i}
fi
done
When compiling Ginkgo with the flag -DGINKGO_BUILD_BENCHMARKS=ON
, a suite of
executables will be generated depending on the CMake configuration. These
executables are the backbone of the benchmarking suite. Note that all of these
executables describe the available options and the required input format when
running them with the --help
option. All executables have multiple variants
depending on the precision, by default double
precision is used for the type
of values, but variants with single
and complex
(single and double) value
types are also available. Here is a non exhaustive list of the available
benchmarks:
blas/blas
: supports benchmarking many of Ginkgo's BLAS operations: dot products, axpy, copy, etc.conversion/conversion
: conversion between matrix formats.matrix_generator/matrix_generator
: mostly allows generating block diagonal matrices (to benchmark the block-jacobi preconditioner).matrix_statistics/matrix_statistics
: computes size and other matrix statistics (such as variance, load imbalance, ...).preconditioner/preconditioner
: benchmarks most Ginkgo preconditioner.solver/solver
: benchmark most of Ginkgo's solvers in a non distributed setting.sparse_blas/sparse_blas
: benchmarks Sparse BLAS operations, such as SpGEMM, SpGEAM, transpose.spmv/spmv
: benchmarks Ginkgo's matrix formats (Sparse-Matrix Vector product).
Optionally when compiling with MPI support:
blas/distributed/multi_vector
: measures BLAS performance on (distributed) multi-vectors.solver/distributed/solver
: distributed solver benchmarks.spmv/distributed/spmv
: distributed matrix Sparse-Matrix Vector (SpMV) product benchmark.
All benchmarks require input data as in a JSON
format. The json file has to
consist of exactly one array, and within that array the test cases are defined.
The exact syntax can change between executables, the --help
option will
explain the necessary JSON
input format. For example for the spmv
benchmark
case, and many other benchmarks the following minimal input should be provided:
[
{
"filename": "path/to/your/matrix",
"rhs": "path/to/your/rhs"
},
{ ... }
]
The files have to be in matrix market format.
Some benchmarks require some extra fields. For example the solver benchmarks
requires the field "optimal": {"spmv": "matrix format (such as csr)"}
. This is
automatically populated when running the spmv
benchmark which finds the
optimal (fastest) format among all requested formats.
After writing the necessary data in a JSON file, the benchmark can be called by passing in the input via stdin, i.e.
./solver < input.json
The output of our benchmarks is again JSON, and it is printed to stdout, while our status messages are printed to stderr. So, the output can be stored with
./solver < input.json > output.json
Note that in most cases, the JSON output by our benchmarks is compatible with
other benchmarks, therefore it is possible to first call the spmv
benchmark,
use the resulting output JSON as input to the solver
benchmark, and finally
use the resulting solver JSON output as input to the preconditioner
benchmark.
The benchmark suite is invoked using the make benchmark
command in the build
directory. Under the hood, this command simply calls the script
benchmark/run_all_benchmarks.sh
so it is possible to manually launch this
script as well. The behavior of the suite can be modified using environment
variables. Assuming the bash
shell is used, these can either be specified via
the export
command to persist between multiple runs:
export VARIABLE="value"
...
make benchmark
or specified on the fly, on the same line as the make benchmark
command:
VARIABLE="value" ... make benchmark
Since make
sets any variables passed to it as temporary environment variables,
the following shorthand can also be used:
make benchmark VARIABLE="value" ...
A combination of the above approaches is also possible (e.g. it may be useful to
export
the SYSTEM_NAME
variable, and specify the others at every benchmark
run).
The benchmark suite can take a number of configuration parameters. Benchmarks
can be run only for sparse matrix vector products (spmv)
, for full solvers
(with or without preconditioners), or for preconditioners only when supported.
The benchmark suite also allows to target a sub-part of the SuiteSparse matrix
collection. For details, see the [available benchmark options](### 6: Available
benchmark options). Here are the most important options:
-
BENCHMARK={spmv, solver, preconditioner}
- allows to select the type of benchmark to be ran. -
EXECUTOR={reference,cuda,hip,omp,dpcpp}
- select the executor and platform the benchmarks should be ran on. -
SYSTEM_NAME=<name>
- a name which will be used to designate this platform (e.g. V100, RadeonVII, ...). -
SEGMENTS=<N>
- Split the benchmarked matrix space into<N>
segments. If specified,SEGMENT_ID
also has to be set. -
SEGMENT_ID=<I>
- used in combination with theSEGMENTS
variable.<I>
should be an integer between 1 and<N>
, the number ofSEGMENTS
. If specified, only the<I>
-th segment of the benchmark suite will be run. -
BENCHMARK_PRECISION
- defines the precision the benchmarks are run in. Supported values are: "double" (default), "single", "dcomplex" and "scomplex" -
MATRIX_LIST_FILE=/path/to/matrix_list.file
- allows to list SuiteSparse matrix id or name to benchmark. As an example, a matrix list file containing the following will ensure that benchmarks are ran for only those three matrices:1903 Freescale/circuit5M thermal2
The previous experiments generated json files for each matrices, each containing
timing, iteration count, achieved precision, ... depending on the type of
benchmark run. These files are available in the directory
${ginkgo_build_dir}/benchmark/results/
. These files can be analyzed and
processed through any tool (e.g. python). In this section, we describe how to
generate the plots by using Ginkgo's
GPE tool. First, we need to publish the
experiments into a Github repository which will be then linked as source input
to the GPE. For this, we can simply fork the ginkgo-data repository. To do so,
we can go to the github repository and use the forking interface:
https://github.com/ginkgo-project/ginkgo-data/
Once it's done, we want to clone the repository locally, put all results online and access the GPE for plotting the results. Here are the detailed steps:
git clone https://github.com/<username>/ginkgo-data.git $HOME/ginkgo_benchmark/ginkgo-data
# send the benchmarked data to the ginkgo-data repository
# If needed, remove the old data so that no previous data is left.
# rm -r ${HOME}/ginkgo_benchmark/ginkgo-data/data/${SYSTEM_NAME}
rsync -rtv ${ginkgo_build_dir}/benchmark/results/ $HOME/ginkgo_benchmark/ginkgo-data/data/
cd ${HOME}/ginkgo_benchmark/ginkgo-data/data/
# The following updates the main `.json` files with the list of data.
# Ensure a python 3 installation is available.
./build-list . > list.json
./agregate < list.json > agregate.json
./represent . > represent.json
git config --local user.name "<Name>"
git config --local user.email "<email>"
git commit -am "Ginkgo benchmark ${BENCHMARK} of ${SYSTEM_NAME}..."
git push
Note that depending on what data is of interest, you may need to update the
scripts build-list
or agregate
to change which files you want to agglomerate
and summarize (depending on the system name), or which data you want to select
(solver results, spmv results, ...).
For the generating the plots in the GPE, here are the steps to go through:
- Access the GPE: https://ginkgo-project.github.io/gpe/
- Update data root URL, from
https://raw.githubusercontent.com/ginkgo-project/ginkgo-data/master/data
tohttps://raw.githubusercontent.com/<username>/ginkgo-data/<branch>/data
- Click on the arrow to load the data, select the
Result Summary
entry above. - Click on
select an example
to choose a plotting script. Multiple scripts are available by default in different branches. You can use thejsonata
andchartjs
languages to develop your own as well. - The results should be available in the tab "plot" on the right side. Other tabs allow to access the result of the processed data after invoking the processing script.
Detailed performance analysis can be ran by passing the environment variable
DETAILED=1
to the benchmarking script. This detailed run is available for
solvers and allows to log the internal residual after every iteration as well as
log the time taken by all operations. These features are also available in the
performance-debugging
example which can be used instead and modified as needed
to analyze Ginkgo's performance.
These features are implemented thanks to the loggers located in the file
${ginkgo_src_dir}/benchmark/utils/loggers.hpp
. Ginkgo possesses hooks at all
important code location points which can be inspected thanks to the logger. In
this fashion, it is easy to use these loggers also for tracking memory
allocation sizes and other important library aspects.
There are a set amount of options available for benchmarking. Most important
options can be configured through the benchmarking script itself thanks to
environment variables. Otherwise, some specific options are not available
through the benchmarking scripts but can be directly configured when running the
benchmarking program itself. For a list of all options, run for example
${ginkgo_build_dir}/benchmark/solver/solver --help
.
The supported environment variables are described in the following list:
BENCHMARK={spmv, solver, preconditioner}
- allows to select the type of benchmark to be ran. Default isspmv
.spmv
- Runs the sparse matrix-vector product benchmarks on the SuiteSparse collection.solver
- Runs the solver benchmarks on the SuiteSparse collection. The matrix format is determined by running thespmv
benchmarks first, and using the fastest format determined by that benchmark.preconditioner
- Runs the preconditioner benchmarks on artificially generated block-diagonal matrices.
EXECUTOR={reference,cuda,hip,omp,dpcpp}
- select the executor and platform the benchmarks should be ran on. Default iscuda
.SYSTEM_NAME=<name>
- a name which will be used to designate this platform (e.g. V100, RadeonVII, ...) and not overwrite previous results. Default isunknown
.SEGMENTS=<N>
- Split the benchmarked matrix space into<N>
segments. If specified,SEGMENT_ID
also has to be set. Default is1
.SEGMENT_ID=<I>
- used in combination with theSEGMENTS
variable.<I>
should be an integer between 1 and<N>
, the number ofSEGMENTS
. If specified, only the<I>
-th segment of the benchmark suite will be run. Default is1
.MATRIX_LIST_FILE=/path/to/matrix_list.file
- allows to list SuiteSparse matrix id or name to benchmark. As an example, a matrix list file containing the following will ensure that benchmarks are ran for only those three matrices:1903 Freescale/circuit5M thermal2
DEVICE_ID
- the accelerator device ID to target for the benchmark. The default is0
.DRY_RUN={true, false}
- If set totrue
, prepares the system for the benchmark runs (downloads the collections, creates the result structure, etc.) and outputs the list of commands that would normally be run, but does not run the benchmarks themselves. Default isfalse
.PRECONDS={jacobi,ic,ilu,paric,parict,parilu,parilut,ic-isai,ilu-isai,paric-isai,parict-isai,parilu-isai,parilut-isai,none}
the preconditioners to use for eithersolver
orpreconditioner
benchmarks. Multiple options can be passed to this variable. Default isnone
.FORMATS={csr,coo,ell,hybrid,sellp,hybridxx,cusparse_xx,hipsparse_xx}
the matrix formats to benchmark for thespmv
phase of the benchmark. Run${ginkgo_build_dir}/benchmark/spmv/spmv --help
for a full list. If needed, multiple options for hybrid with different optimization parameters are available. Depending on the libraries available at build time, vendor library formats (cuSPARSE withcusparse_
prefix or hipSPARSE withhipsparse_
prefix) can be used as well. Multiple options can be passed. The default iscsr,coo,ell,hybrid,sellp
.SOLVERS={bicgstab,bicg,cg,cgs,fcg,gmres,cb_gmres_{keep,reduce1,reduce2,integer,ireduce1,ireduce2},lower_trs,upper_trs}
- the solvers which should be benchmarked. Multiple options can be passed.
The default is
bicgstab,cg,cgs,fcg,gmres,idr
. Note thatlower/upper_trs
by default don't use a preconditioner, as they are by default exact direct solvers.
- the solvers which should be benchmarked. Multiple options can be passed.
The default is
SOLVERS_PRECISION=<precision>
- the minimal residual reduction before which the solver should stop. The default is1e-6
.SOLVERS_MAX_ITERATIONS=<number>
- the maximum number of iterations with which a solver should be ran. The default is10000
.SOLVERS_RHS={1, random, sinus}
- whether to use a vector of all ones, random values or b = A * (s / |s|)$ with s(idx) = sin(idx) (for complex numbers, s(idx) = sin(2idx) + i * sin(2idx+1)) as the right-hand side in solver benchmarks. Default is1
.SOLVERS_INITIAL_GUESS={rhs,0,random}
- the initial guess generation of the solvers.rhs
uses the right-hand side,0
uses a zero vector andrandom
generates a random vector as the initial guess.DETAILED={0,1}
- selects whether detailed benchmarks should be ran. This generally provides extra, verbose information at the cost of one or more extra benchmark runs. It can be either0
(off) or1
(on).GPU_TIMER={true, false}
- If set totrue
, use the gpu timer, which is valid for cuda/hip executor, to measure the timing. Default isfalse
.SOLVERS_JACOBI_MAX_BS
- sets the maximum block size for the Jacobi preconditioner (if used, otherwise, it does nothing) in the solvers benchmark. The default is '32'.SOLVERS_GMRES_RESTART
- the maximum dimension of the Krylov space to use in GMRES. The default is100
.