Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fallback to lower-memory cuSparse SpGEMM algorithm #1794

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

izilberter
Copy link

Address an issue where the default CuSPARSE SpGEMM algorithm estimates an overly large memory buffer for matrices greater than ~4 million rows, causing a memory allocation exception regardless of the actual GPU memory capacity. Since CUDA 12.0, alternate, less memory-intensive algorithms for SpGEMM have been introduced to fix the issue.

The spgemm and advanced_spgemm cuda routines now attempt to compute the matrix product using the default CUSPARSE_SPGEMM_ALG1 algorithm, and if it fails, fall back to CUSPARSE_SPGEMM_ALG2. The CuSparse bindings for spgemm-related functions are updated to take the algorithm as an argument.

A simple test .cpp file
spgemm.txt
is attached to show the creation of a Poisson operator by multiplying two matrices. The sample output pre-patch:

./spgemm cuda 4194305
terminate called after throwing an instance of 'gko::CusparseError'
what(): ginkgo/cuda/base/cusparse_bindings.hpp:228: spgemm_compute: Unknown error

Post patch:

./spgemm cuda 4194305
Matrix multiply time(s): 0.0102431

Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! Can you also add yourself to the CONTRIBUTORS list (sorted alphabetically IIRC)? One change request from my side

@upsj
Copy link
Member

upsj commented Feb 20, 2025

We are using a semi-linear merge workflow, so instead of merging the changes from develop into this branch, please rebase onto develop so your branch is just a linear sequence of commits on top of develop.
Did you encounter an easy way to trigger this error, so we can test the fall-back execution path?

@izilberter
Copy link
Author

We are using a semi-linear merge workflow, so instead of merging the changes from develop into this branch, please rebase onto develop so your branch is just a linear sequence of commits on top of develop. Did you encounter an easy way to trigger this error, so we can test the fall-back execution path?

Got it, thanks.

The test in the file attached with the pull request (spgemm.txt) is the most reliable way I've found, and basically where the problem came up in the wild, so to speak: Multiply 2csr matrices where the result would contain >~ 13M nonzeros. In the attached example, this is triggered with ./spgemm cuda 4200000. There are further examples of failure cases in this thread:
NVIDIA/CUDALibrarySamples#38

Address an issue where the default CuSPARSE SpGEMM
algorithm estimates an overly large memory buffer for matrices
greater than ~4 million rows, causing a memory allocation
exception regardless of the actual GPU memory capacity. Since
CUDA 12.0, alternate, less memory-intensive algorithms for
SpGEMM have been introduced to fix the issue.

The spgemm and advanced_spgemm cuda routines now attempt
to compute the matrix product using the default CUSPARSE_SPGEMM_ALG1
algorithm, and if it fails, fall back to CUSPARSE_SPGEMM_ALG2.
Update the CuSparse bindings for spgemm-related functions
to take the algorithm as an argument.
Use this to check for CUSPARSE_STATUS_INSUFFICIENT_RESOURCES
when falling back to spgemm ALG2.
@upsj
Copy link
Member

upsj commented Feb 21, 2025

Your example has a slight bug BTW, which causes the cuSPARSE transpose routine to do something slightly weird - you claim to have 2 * discretization_points non-zeros, but only fill 2 * discretization_points - 1 of them, the resulting transposed matrix has an additional entry (1, 4200000) that doesn't belong.

@izilberter
Copy link
Author

izilberter commented Feb 21, 2025

Your example has a slight bug BTW, which causes the cuSPARSE transpose routine to do something slightly weird - you claim to have 2 * discretization_points non-zeros, but only fill 2 * discretization_points - 1 of them, the resulting transposed matrix has an additional entry (1, 4200000) that doesn't belong.

Wow, that is a good catch, and doesn't pop up in the reference executor. With the correct nnz specified, the fallback is never triggered and the multiply works until my card runs out of memory (at 30MM rows or so). I don't think this is related to why I encountered the memory limit in the first place, as the matrices in my application code are all read from matrix_data and I don't use a transpose, but good to be aware of.

I had to rig up something closer to my actual application to reproduce the error again; here I fill a discrete gradient matrix on an NX*NX grid, then compute grad * grad^T (NB: Not grad^T * grad, which would produce a standard Laplacian).

I get the following output:

./spgemm cuda 2000
Matrix multiply time(s): 0.054936

./spgemm cuda 2500
ginkgo/cuda/base/cusparse_bindings.hpp:247: spgemm_compute: CUSPARSE_STATUS_INSUFFICIENT_RESOURCES,
Falling back to Alg2
Matrix multiply time(s): 0.129386

spgemm.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants