WIP: SVE intrinsics implementation of CSR SpMV and Merge-SpMV algorithms #1501
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR provides two implementations of CSR SpMV ("traditional" and Merge-SpMV from https://github.com/dumerrill/merge-spmv/raw/master/merge-based-spmv-sc16-preprint.pdf ) using SVE intrinsics for double precision. PR is far from being integration ready, and it should be considered more of an example of how the implementation could look like. One should eventually also apply the suggestions from PR #1497 about RHS, integration (
a->get_strategy()
), and OpenMP scheduling. To ease the testing, I put the current implementation in place of the OpenMP CSR SpMV, although it should probably be in a completely separate (completely new?) part of Ginkgo.The motivation for having code with SVE intrinsics is performance. SVE intrinsics implementations can bring significantly better vectorization for Arm machines supporting SVE (Fujitsu A64FX, Amazon Graviton, Nvidia Grace...), since GCC auto-vectorization for CSR kernel seems to be poor. We have measured up to 80% performance improvements for bone010.mtx on Fujitsu A64FX and up to 36% improvements for thermal2.mtx on Amazon Graviton3 machine when using this implementation with SVE intrinsics.
Unlike AVX intrinsics, SVE allows vector length agnostic implementations which leads to a cleaner code. The code in the proposed PR works on both A64FX (512b vector length) and Graviton 3 (256b vector length).
On the other hand, AFAIK there is no easy way to deal with different datatypes (double, float, complex...), and one needs separate intrinsics implementations. The code for the proposed PR works only for double precision.
Finally, note that the OpenMP parallelization is commented out in the code. The reason behind this is the known internal bug of the GCC compiler ( https://gcc.gnu.org/bugzilla//show_bug.cgi?id=101018 ) which sometimes occurs when OpenMP pragmas are combined with SVE intrinsics. I hope that other compilers do not have this issue, and already committed fix to GCC is upstreamed soon. When this problem is fixed, one should simple uncomment OpenMP pragmas in this PR, and the code should work in parallel.