Add fallback to lower-memory cuSparse SpGEMM algorithm #1794

izilberter · 2025-02-19T04:59:28Z

Address an issue where the default CuSPARSE SpGEMM algorithm estimates an overly large memory buffer for matrices greater than ~4 million rows, causing a memory allocation exception regardless of the actual GPU memory capacity. Since CUDA 12.0, alternate, less memory-intensive algorithms for SpGEMM have been introduced to fix the issue.

The spgemm and advanced_spgemm cuda routines now attempt to compute the matrix product using the default CUSPARSE_SPGEMM_ALG1 algorithm, and if it fails, fall back to CUSPARSE_SPGEMM_ALG2. The CuSparse bindings for spgemm-related functions are updated to take the algorithm as an argument.

A simple test .cpp file
spgemm.txt
is attached to show the creation of a Poisson operator by multiplying two matrices. The sample output pre-patch:

./spgemm cuda 4194305
terminate called after throwing an instance of 'gko::CusparseError'
what(): ginkgo/cuda/base/cusparse_bindings.hpp:228: spgemm_compute: Unknown error

Post patch:

./spgemm cuda 4194305
Matrix multiply time(s): 0.0102431

upsj

Thanks for your contribution! Can you also add yourself to the CONTRIBUTORS list (sorted alphabetically IIRC)? One change request from my side

common/cuda_hip/matrix/csr_kernels.template.cpp

upsj · 2025-02-20T16:52:41Z

We are using a semi-linear merge workflow, so instead of merging the changes from develop into this branch, please rebase onto develop so your branch is just a linear sequence of commits on top of develop.
Did you encounter an easy way to trigger this error, so we can test the fall-back execution path?

izilberter · 2025-02-20T17:17:42Z

We are using a semi-linear merge workflow, so instead of merging the changes from develop into this branch, please rebase onto develop so your branch is just a linear sequence of commits on top of develop. Did you encounter an easy way to trigger this error, so we can test the fall-back execution path?

Got it, thanks.

The test in the file attached with the pull request (spgemm.txt) is the most reliable way I've found, and basically where the problem came up in the wild, so to speak: Multiply 2csr matrices where the result would contain >~ 13M nonzeros. In the attached example, this is triggered with ./spgemm cuda 4200000. There are further examples of failure cases in this thread:
NVIDIA/CUDALibrarySamples#38

Address an issue where the default CuSPARSE SpGEMM algorithm estimates an overly large memory buffer for matrices greater than ~4 million rows, causing a memory allocation exception regardless of the actual GPU memory capacity. Since CUDA 12.0, alternate, less memory-intensive algorithms for SpGEMM have been introduced to fix the issue. The spgemm and advanced_spgemm cuda routines now attempt to compute the matrix product using the default CUSPARSE_SPGEMM_ALG1 algorithm, and if it fails, fall back to CUSPARSE_SPGEMM_ALG2. Update the CuSparse bindings for spgemm-related functions to take the algorithm as an argument.

…NT_RESOURCES

Use this to check for CUSPARSE_STATUS_INSUFFICIENT_RESOURCES when falling back to spgemm ALG2.

upsj · 2025-02-21T11:45:39Z

Your example has a slight bug BTW, which causes the cuSPARSE transpose routine to do something slightly weird - you claim to have 2 * discretization_points non-zeros, but only fill 2 * discretization_points - 1 of them, the resulting transposed matrix has an additional entry (1, 4200000) that doesn't belong.

izilberter · 2025-02-21T21:49:53Z

Your example has a slight bug BTW, which causes the cuSPARSE transpose routine to do something slightly weird - you claim to have 2 * discretization_points non-zeros, but only fill 2 * discretization_points - 1 of them, the resulting transposed matrix has an additional entry (1, 4200000) that doesn't belong.

Wow, that is a good catch, and doesn't pop up in the reference executor. With the correct nnz specified, the fallback is never triggered and the multiply works until my card runs out of memory (at 30MM rows or so). I don't think this is related to why I encountered the memory limit in the first place, as the matrices in my application code are all read from matrix_data and I don't use a transpose, but good to be aware of.

I had to rig up something closer to my actual application to reproduce the error again; here I fill a discrete gradient matrix on an NX*NX grid, then compute grad * grad^T (NB: Not grad^T * grad, which would produce a standard Laplacian).

I get the following output:

./spgemm cuda 2000
Matrix multiply time(s): 0.054936

./spgemm cuda 2500
ginkgo/cuda/base/cusparse_bindings.hpp:247: spgemm_compute: CUSPARSE_STATUS_INSUFFICIENT_RESOURCES,
Falling back to Alg2
Matrix multiply time(s): 0.129386

spgemm.txt

upsj requested changes Feb 19, 2025

View reviewed changes

common/cuda_hip/matrix/csr_kernels.template.cpp Show resolved Hide resolved

izilberter force-pushed the cuda_spgemm_alg2 branch from 1a91597 to ccbdf19 Compare February 20, 2025 22:26

izilberter added 3 commits February 20, 2025 15:33

Fallback cusparse_spgemm algorithm only if CUSPARSE_STATUS_INSUFFICIE…

8c9ddb3

…NT_RESOURCES

Add accessor for CusparseError error code

f4d20f6

Use this to check for CUSPARSE_STATUS_INSUFFICIENT_RESOURCES when falling back to spgemm ALG2.

izilberter force-pushed the cuda_spgemm_alg2 branch from ccbdf19 to f4d20f6 Compare February 20, 2025 22:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fallback to lower-memory cuSparse SpGEMM algorithm #1794

Add fallback to lower-memory cuSparse SpGEMM algorithm #1794

izilberter commented Feb 19, 2025

upsj left a comment

upsj commented Feb 20, 2025

izilberter commented Feb 20, 2025

upsj commented Feb 21, 2025

izilberter commented Feb 21, 2025 •

edited

Loading

Add fallback to lower-memory cuSparse SpGEMM algorithm #1794

Are you sure you want to change the base?

Add fallback to lower-memory cuSparse SpGEMM algorithm #1794

Conversation

izilberter commented Feb 19, 2025

upsj left a comment

Choose a reason for hiding this comment

upsj commented Feb 20, 2025

izilberter commented Feb 20, 2025

upsj commented Feb 21, 2025

izilberter commented Feb 21, 2025 • edited Loading

izilberter commented Feb 21, 2025 •

edited

Loading