Performance evaluation for sparse linear algebra libraries for CPU and GPU targets.
This evaluation attempts to measure execution time of common and most
popular sparse matrix operations x
and +
over boolean semiring.
The purpose of the evaluation is to show the performance gain between the Boolean optimized and general-purpose operations. The comparison is not entirely fair, but the Boolean optimized libraries for GPU have not been introduced yet.
Note: some of these tested libraries does not provide extra optimizations for the boolean-values operations. Thus, these operations are imitated by the provided primitives, where non-zero (floating point type) values are interpreted as true.
Library name | Target platform | Technology | Operation x |
Operation + |
---|---|---|---|---|
cuBool | GPU | Nvidia Cuda | yes | yes |
clBool | GPU | OpenCL | yes | yes |
CUSP | GPU | Nvidia Cuda | yes | yes |
cuSPARSE | GPU | Nvidia Cuda | yes | yes |
clSPARSE | GPU | OpenCL | yes | no |
SuiteSparse | CPU | CPU | yes | yes |
- GCC Compiler
- CMake 3.15 (or greater)
- Cuda Toolkit
- OpenCL Dev-Package
- SuiteSparse
Get the source code and go into project directory.
$ git clone https://github.com/EgorOrachyov/SpBench
$ cd SpBench
$ git submodule update --init --recursive
The dataset for benchmarks is available at Google Drive by the following
access link.
Dataset is a collection of several square sparse matrices in the .mtx
format.
These matrices represent various network graphs, which can be used for
benchmarking x
and +
operations, what makes sense.
The dataset is stored as two .zip
archives dataset0.zip
and dataset1.zip
.
Download these archives and place it in the repository root into dataset
folder.
This folder must be created manually.
After that, the path to the archives must be SpBench/dataset/dataset0.zip
and
SpBench/dataset/dataset1.zip
respectively. These archives will be automatically
extracted and placed into build directory in time of the build process.
Create build directory and go into it.
The example demonstrates, how to build benchmarks in release
mode.
$ mkdir build-release
$ cd build-release
Configure build and run actual compilation process.
$ cmake .. -DCMAKE_BUILD_TYPE=Release
$ cmake --build . --target all -j `nproc`
After compilation process benchmark executables
will be stored in the build directory. The naming follows the next
pattern: {tool-name}_{operation name}
.
For benchmarking operation R = A + A^2
we need to generate A^2
matrix for matrix
entry in dataset. Matrix A^2
is generated by cuBool library based application.
In order to run generation, execute the following script snippet inside build directory:
$ ./cubool_prepare_data data/config_gen_m2.txt
This will generate required A^2
matrices and store it inside the same data folders,
as original A
matrices with extension suffix .mtx2
instead of .mtx
.
These matrices are automatically loaded inside benchmarks by MatrixLoader2
class,
which automatically appends 2
to the name of the loaded target A
matrix and loads A^2
matrix.
In order to run benchmark for all tested targets execute the following script snippet inside build directory:
$ bash run_all.sh
$ bash summarize.sh
This will output Summary.txt
file with benchmark stats.
In order to get the peak GPU memory usage, first of all, we need to collect the GPU memory usage with time step in 1 ms. To do this, run in the separate terminal the following command:
$ nvidia-smi --query-gpu=timestamp,pstate,memory.total,memory.used --format=csv -lms 1 -f Profiling.csv
This will run the nvidia-smi built-in memory profiler, which will sample the memory usage stat
with 1 ms time step, and save this information into table Profiling.csv
.
In the main terminal execute the following command, to run actual libraries benchmarks (since we want to obtain their peak memory usage):
$ bash run_profiling.sh
When the last library finishes its benchmark, go into separate terminal and explicitly terminate
the memory capturing process by ctrl + C
combination.
After that in your build directory two files will be located: Profiling.csv
with memory usage and Profiling-Time.txt
with time stamps (begin and end of each experiment). In order to extract actual peak memory usage,
we will have to run python script scripts/extract_mem_profile.py
. In order to do that, copy
this script inside build directory and run it.
When script finises its job, it will output file Profiling-Stats.txt
with peak memory usage for
each library benchmark and for each matrix entry in dataset.
The matrix data is selected from The SuiteSparse Matrix Collection (formerly the University of Florida Sparse Matrix Collection) link.
№ | Matrix name | # Rows | Nnz M | Nnz/row | Max Nnz/row | Nnz M^2 | Nnz M + M^2 |
---|---|---|---|---|---|---|---|
0 | DIMACS10/wing | 62,032 | 243,088 | 3.9 | 4 | 714,200 | 917,178 |
1 | DIMACS10/luxembourg_osm | 114,599 | 239,332 | 2.0 | 6 | 393,261 | 632,185 |
2 | SNAP/amazon0312 | 400,727 | 3,200,440 | 7.9 | 10 | 14,390,544 | 14,968,909 |
3 | LAW/amazon-2008 | 735,323 | 5,158,388 | 7.0 | 10 | 25,366,745 | 26,402,678 |
4 | SNAP/web-Google | 916,428 | 5,105,039 | 5.5 | 456 | 29,710,164 | 30,811,855 |
5 | SNAP/roadNet-PA | 1,090,920 | 3,083,796 | 2.8 | 9 | 7,238,920 | 9,931,528 |
6 | SNAP/roadNet-TX | 1,393,383 | 3,843,320 | 2.7 | 12 | 8,903,897 | 12,264,987 |
7 | DIMACS10/belgium_osm | 1,441,295 | 3,099,940 | 2.1 | 10 | 5,323,073 | 8,408,599 |
8 | SNAP/roadNet-CA | 1,971,281 | 5,533,214 | 2.8 | 12 | 12,908,450 | 17,743,342 |
9 | DIMACS10/netherlands_osm | 2,216,688 | 4,882,476 | 2.2 | 7 | 8,755,758 | 13,626,132 |
For performance evaluation, PC with Ubuntu 20.04 installed was used. It had Intel core i7-6700 CPU, 3.40GHz, DDR4 64Gb RAM and GeForce GTX 1070 GPU with 8Gb VRAM. Only the execution time of the operations themselves was measured. The actual data were assumed to be loaded into the VRAM or RAM respectively in the appropriate format, required for the target tested framework.
Matrix-matrix multiplication evaluation results.
Time in milliseconds, Memory in megabytes.
№ | cuBool | clBool | CUSP | cuSPARSE | clSPARSE | SuiteSparse |
---|---|---|---|---|---|---|
M | Time&Mem | Time&Mem | Time&Mem | Time&Mem | Time&Mem | Time&Mem |
0 | 1.9 93 | 1.9 89 | 5.2 125 | 20.1 155 | 4.9 105 | 7.9 22 |
1 | 2.4 91 | 2.0 89 | 3.7 111 | 1.7 151 | 6.9 97 | 3.1 169 |
2 | 23.2 165 | 55.5 163 | 108.5 897 | 412.8 301 | 52.2 459 | 257.6 283 |
3 | 33.3 225 | 82.1 221 | 172.0 1409 | 184.8 407 | 77.4 701 | 369.5 319 |
4 | 41.8 241 | 127.6 239 | 246.2 1717 | 4761.3 439 | 207.5 1085 | 673.3 318 |
5 | 18.1 157 | 14.2 153 | 42.1 481 | 37.5 247 | 56.6 283 | 66.6 294 |
6 | 22.6 167 | 16.9 165 | 53.1 581 | 46.7 271 | 70.4 329 | 80.7 328 |
7 | 23.2 151 | 16.9 159 | 32.9 397 | 26.7 235 | 68.2 259 | 56.9 302 |
8 | 32.0 199 | 23.4 211 | 74.4 771 | 65.8 325 | 98.2 433 | 114.5 344 |
9 | 35.3 191 | 24.9 189 | 51.0 585 | 51.4 291 | 102.8 361 | 90.9 311 |
Element-wise matrix-matrix addition evaluation results.
Time in milliseconds, Memory in megabytes.
№ | cuBool | clBool | CUSP | cuSPARSE | clSPARSE | SuiteSparse |
---|---|---|---|---|---|---|
M | Time&Mem | Time&Mem | Time&Mem | Time&Mem | Time&Mem | Time&Mem |
0 | 1.1 95 | 1.9 105 | 1.4 105 | 2.4 163 | - - | 2.3 176 |
1 | 1.7 95 | 1.6 109 | 1.0 97 | 0.8 151 | - - | 1.6 174 |
2 | 11.4 221 | 23.8 543 | 16.2 455 | 24.3 405 | - - | 37.2 297 |
3 | 17.5 323 | 35.4 877 | 29.5 723 | 27.2 595 | - - | 64.8 319 |
4 | 24.8 355 | 43.1 989 | 31.9 815 | 89.0 659 | - - | 77.5 318 |
5 | 16.9 189 | 12.5 359 | 11.2 329 | 11.6 317 | - - | 36.6 287 |
6 | 19.6 209 | 15.4 429 | 14.5 385 | 16.9 357 | - - | 45.3 319 |
7 | 19.5 179 | 10.5 321 | 10.2 303 | 10.5 297 | - - | 28.5 302 |
8 | 30.5 259 | 22.4 579 | 19.4 513 | 20.2 447 | - - | 65.2 331 |
9 | 30.1 233 | 18.6 457 | 14.8 423 | 18.3 385 | - - | 50.2 311 |
clSPARSE
based matrix-matrix multiplication benchmark crashes inRelease Mode
with error:terminate called after throwing an instance of 'std::bad_cast' what(): std::bad_cast
This project is licensed under MIT license. Full license text can be found in the license file. For the detailed license information of the dependencies refer to these third party projects official web pages.