-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fallback to lower-memory cuSparse SpGEMM algorithm #1794
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! Can you also add yourself to the CONTRIBUTORS list (sorted alphabetically IIRC)? One change request from my side
We are using a semi-linear merge workflow, so instead of merging the changes from |
Got it, thanks. The test in the file attached with the pull request (spgemm.txt) is the most reliable way I've found, and basically where the problem came up in the wild, so to speak: Multiply 2csr matrices where the result would contain >~ 13M nonzeros. In the attached example, this is triggered with ./spgemm cuda 4200000. There are further examples of failure cases in this thread: |
1a91597
to
ccbdf19
Compare
Address an issue where the default CuSPARSE SpGEMM algorithm estimates an overly large memory buffer for matrices greater than ~4 million rows, causing a memory allocation exception regardless of the actual GPU memory capacity. Since CUDA 12.0, alternate, less memory-intensive algorithms for SpGEMM have been introduced to fix the issue. The spgemm and advanced_spgemm cuda routines now attempt to compute the matrix product using the default CUSPARSE_SPGEMM_ALG1 algorithm, and if it fails, fall back to CUSPARSE_SPGEMM_ALG2. Update the CuSparse bindings for spgemm-related functions to take the algorithm as an argument.
Use this to check for CUSPARSE_STATUS_INSUFFICIENT_RESOURCES when falling back to spgemm ALG2.
ccbdf19
to
f4d20f6
Compare
Your example has a slight bug BTW, which causes the cuSPARSE transpose routine to do something slightly weird - you claim to have |
Wow, that is a good catch, and doesn't pop up in the reference executor. With the correct nnz specified, the fallback is never triggered and the multiply works until my card runs out of memory (at 30MM rows or so). I don't think this is related to why I encountered the memory limit in the first place, as the matrices in my application code are all read from I had to rig up something closer to my actual application to reproduce the error again; here I fill a discrete gradient matrix on an I get the following output:
|
Address an issue where the default CuSPARSE SpGEMM algorithm estimates an overly large memory buffer for matrices greater than ~4 million rows, causing a memory allocation exception regardless of the actual GPU memory capacity. Since CUDA 12.0, alternate, less memory-intensive algorithms for SpGEMM have been introduced to fix the issue.
The spgemm and advanced_spgemm cuda routines now attempt to compute the matrix product using the default CUSPARSE_SPGEMM_ALG1 algorithm, and if it fails, fall back to CUSPARSE_SPGEMM_ALG2. The CuSparse bindings for spgemm-related functions are updated to take the algorithm as an argument.
A simple test .cpp file
spgemm.txt
is attached to show the creation of a Poisson operator by multiplying two matrices. The sample output pre-patch:
Post patch: