-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batched dgemm to CCSD(T) for openmp offload and openacc implementations #980
Add batched dgemm to CCSD(T) for openmp offload and openacc implementations #980
Conversation
@omarkahmed could you please rebase your pull request to the current master? |
c66b913
to
c697434
Compare
@edoapra , just rebased. |
@omarkahmed This comment is about the previous commit 05f5cb5 you made 8 months ago Why did you put options for the C compiler when the ccsd compiler does not contain any C source code? |
Thanks, that's a good point. This was due to earlier versions which had some C source. Currently sanity checking that removing this code doesn't impact my unit tests. |
I think that the Fortran options might need some cleanup, too. |
The "OpenACC" version uses CUDA Fortran and only compiles with NVHPC Fortran. I didn't bother renaming it but it really should just be the "nvhpc" or "Nvidia" version. If you added batched CUBLAS, I'll take a look, but the dimensions used in CCSD(T) are known to not leverage batching effectively. The current version with async streams splits the batch of GEMMs so adding one batched call is more synchronous. It's unlikely to be better on any NVIDIA GPU, which is the only supported hardware for the code as written. |
@edoapra , thanks for the suggestion! I just pushed up a patch to separate out the INCLUDES and DEFINES.
@jeffhammond , yes these are batched CUBLAS calls. Thanks for the additional information and review! I can remove the batched calls if you think that makes more sense. |
I'm on vacation but I'll test the batched stuff when I get back. If it's optional, it shouldn't do any harm to be there. |
I have just rebased |
@jeffhammond could you please have a look at the commit I have just made to the ccsd USE_OPENACC bit? |
do you have performance numbers for the NVIDIA implementation here? this is a large change and since i will end up maintaining it, i want to understand what the upside is. |
@jeffhammond what environment variables are you use to link when I am getting a bunch of cuda and pgf90/pgi undefined objects.
|
this code is 6 times slower than it was before. i have no idea why you are contributing this without doing basic testing to determine that it is valuable to NWChem's developers and users. without batching
with batching
these tests were with 4*H2O with cc-pVTZ on a 4090. the utility of batching goes down as matrix sizes are larger, so 6x is on the low end of the slowdown one would expect from this change. |
@jeffhammond @omarkahmed Should we mark this pull request as draft for the time being? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert all changes to the OpenACC.F code. They are not good.
Intel should just leave my version alone and contribute theirs on its own. |
@jeffhammond , thanks for the feedback. Will revert the OpenACC.F modifications. |
export USE_DEBUG=1
export USE_F90_ALLOCATABLE=y
export USE_OPENACC_TRPDRV=y
export NWCHEM_LINK_CUDA=y
I think the last one is the one that matters, but I’ll check tomorrow. I already turned my workstation off for the day.
… On 30. Jul 2024, at 20.32, Edoardo Aprà ***@***.***> wrote:
@jeffhammond <https://github.com/jeffhammond> what environment variables are you use to link when USE_OPENACC_TRPDRV=1?
I am getting a bunch of cuda and pgf90/pgi undefined objects.
I can get the link to work by modifying makefile.h with this change
diff -u config/makefile.h config/makefile.h.linkcuda
--- config/makefile.h 2024-07-30 10:23:16.989751519 -0700
+++ config/makefile.h.linkcuda 2024-07-26 11:54:17.992993104 -0700
@@ -3658,7 +3658,7 @@
ifdef NWCHEM_LINK_CUDA
ifeq ($(_FC),pgf90)
- CORE_LIBS += -acc -cuda -cudalib=cublas
+ CORE_LIBS += -L/usr/local/cuda/targets/x86_64-linux/lib/ -acc -cuda -cudalib=cublas
endif
ifeq ($(_FC),gfortran)
CORE_LIBS += -fopenacc -lcublas
|
My link problem vanished after having installed a more recent nvhpc release |
1d079a7
to
2dc63c1
Compare
2dc63c1
to
e7a8e04
Compare
@jeffhammond , I reorganized the commits to eliminate the changes to the cuda version. Please let me know if you have any other requests. |
Thanks. This good for me. |
This is an implementation of Batched DGEMM for CCSD(T) for OpenMP Offload and OpenACC implementations.
Acknowledgements include:
Nawal Copty
Rakesh Krishnaiyer
Abhinav Gaba
Ravi Narayanaswamy
Nitin Gawande