Add batched dgemm to CCSD(T) for openmp offload and openacc implementations #980

omarkahmed · 2024-07-08T23:02:40Z

This is an implementation of Batched DGEMM for CCSD(T) for OpenMP Offload and OpenACC implementations.

Acknowledgements include:
Nawal Copty
Rakesh Krishnaiyer
Abhinav Gaba
Ravi Narayanaswamy
Nitin Gawande

edoapra · 2024-07-08T23:45:13Z

@omarkahmed could you please rebase your pull request to the current master?

omarkahmed · 2024-07-09T00:01:36Z

@edoapra , just rebased.

edoapra · 2024-07-09T00:02:37Z

@omarkahmed This comment is about the previous commit 05f5cb5 you made 8 months ago

Why did you put options for the C compiler when the ccsd compiler does not contain any C source code?
https://github.com/nwchemgit/nwchem/blame/80cff689dbd18943911ee13cc7f4b8a305f52445/src/ccsd/GNUmakefile#L133-L137

omarkahmed · 2024-07-09T00:11:46Z

@omarkahmed This comment is about the previous commit 05f5cb5 you made 8 months ago

Why did you put options for the C compiler when the ccsd compiler does not contain any C source code? https://github.com/nwchemgit/nwchem/blame/80cff689dbd18943911ee13cc7f4b8a305f52445/src/ccsd/GNUmakefile#L133-L137

Thanks, that's a good point. This was due to earlier versions which had some C source. Currently sanity checking that removing this code doesn't impact my unit tests.

edoapra · 2024-07-09T00:30:22Z

I think that the Fortran options might need some cleanup, too.
For example, I see some MKL related bits. Are those needed at compile time or link time?

jeffhammond · 2024-07-09T01:10:32Z

The "OpenACC" version uses CUDA Fortran and only compiles with NVHPC Fortran. I didn't bother renaming it but it really should just be the "nvhpc" or "Nvidia" version.

If you added batched CUBLAS, I'll take a look, but the dimensions used in CCSD(T) are known to not leverage batching effectively. The current version with async streams splits the batch of GEMMs so adding one batched call is more synchronous. It's unlikely to be better on any NVIDIA GPU, which is the only supported hardware for the code as written.

omarkahmed · 2024-07-09T04:51:22Z

I think that the Fortran options might need some cleanup, too. For example, I see some MKL related bits. Are those needed at compile time or link time?

@edoapra , thanks for the suggestion! I just pushed up a patch to separate out the INCLUDES and DEFINES.

The "OpenACC" version uses CUDA Fortran and only compiles with NVHPC Fortran. I didn't bother renaming it but it really should just be the "nvhpc" or "Nvidia" version.

If you added batched CUBLAS, I'll take a look, but the dimensions used in CCSD(T) are known to not leverage batching effectively. The current version with async streams splits the batch of GEMMs so adding one batched call is more synchronous. It's unlikely to be better on any NVIDIA GPU, which is the only supported hardware for the code as written.

@jeffhammond , yes these are batched CUBLAS calls. Thanks for the additional information and review! I can remove the batched calls if you think that makes more sense.

jeffhammond · 2024-07-09T15:13:58Z

I'm on vacation but I'll test the batched stuff when I get back. If it's optional, it shouldn't do any harm to be there.

edoapra · 2024-07-26T20:13:32Z

I have just rebased

edoapra · 2024-07-26T20:14:44Z

@jeffhammond could you please have a look at the commit I have just made to the ccsd USE_OPENACC bit?
3052284

jeffhammond · 2024-07-30T15:54:47Z

do you have performance numbers for the NVIDIA implementation here? this is a large change and since i will end up maintaining it, i want to understand what the upside is.

edoapra · 2024-07-30T17:31:51Z

@jeffhammond what environment variables are you use to link when USE_OPENACC_TRPDRV=1?

I am getting a bunch of cuda and pgf90/pgi undefined objects.
I can get the link to work by modifying makefile.h with this change

diff -u config/makefile.h config/makefile.h.linkcuda 
--- config/makefile.h	2024-07-30 10:23:16.989751519 -0700
+++ config/makefile.h.linkcuda	2024-07-26 11:54:17.992993104 -0700
@@ -3658,7 +3658,7 @@
 
 ifdef NWCHEM_LINK_CUDA
     ifeq ($(_FC),pgf90)
-       CORE_LIBS += -acc -cuda -cudalib=cublas
+       CORE_LIBS += -L/usr/local/cuda/targets/x86_64-linux/lib/ -acc -cuda -cudalib=cublas
     endif
     ifeq ($(_FC),gfortran)
        CORE_LIBS +=  -fopenacc -lcublas

jeffhammond · 2024-07-30T18:03:25Z

this code is 6 times slower than it was before. i have no idea why you are contributing this without doing basic testing to determine that it is valuable to NWChem's developers and users.

without batching

 ccsd(t): 100% done, Aggregate Gflops=  274.3     in      285.1 secs
CU+MEM free took     0.24475E-01 seconds
 Time for integral evaluation pass     1       10.38
 Time for triples evaluation pass      1      285.17

with batching

 ccsd(t): 100% done, Aggregate Gflops=  46.21     in     1692.8 secs
CU+MEM free took     0.24896E-01 seconds
 Time for integral evaluation pass     1       10.22
 Time for triples evaluation pass      1     1692.86

these tests were with 4*H2O with cc-pVTZ on a 4090. the utility of batching goes down as matrix sizes are larger, so 6x is on the low end of the slowdown one would expect from this change.

edoapra · 2024-07-30T18:21:25Z

@jeffhammond @omarkahmed Should we mark this pull request as draft for the time being?

jeffhammond

Please revert all changes to the OpenACC.F code. They are not good.

jeffhammond · 2024-07-30T18:22:44Z

Intel should just leave my version alone and contribute theirs on its own.

omarkahmed · 2024-07-30T18:23:32Z

@jeffhammond , thanks for the feedback. Will revert the OpenACC.F modifications.

jeffhammond · 2024-07-30T18:26:50Z

export USE_DEBUG=1 export USE_F90_ALLOCATABLE=y export USE_OPENACC_TRPDRV=y export NWCHEM_LINK_CUDA=y I think the last one is the one that matters, but I’ll check tomorrow. I already turned my workstation off for the day.

…

On 30. Jul 2024, at 20.32, Edoardo Aprà ***@***.***> wrote: @jeffhammond <https://github.com/jeffhammond> what environment variables are you use to link when USE_OPENACC_TRPDRV=1? I am getting a bunch of cuda and pgf90/pgi undefined objects. I can get the link to work by modifying makefile.h with this change diff -u config/makefile.h config/makefile.h.linkcuda --- config/makefile.h 2024-07-30 10:23:16.989751519 -0700 +++ config/makefile.h.linkcuda 2024-07-26 11:54:17.992993104 -0700 @@ -3658,7 +3658,7 @@ ifdef NWCHEM_LINK_CUDA ifeq ($(_FC),pgf90) - CORE_LIBS += -acc -cuda -cudalib=cublas + CORE_LIBS += -L/usr/local/cuda/targets/x86_64-linux/lib/ -acc -cuda -cudalib=cublas endif ifeq ($(_FC),gfortran) CORE_LIBS += -fopenacc -lcublas

edoapra · 2024-07-30T22:31:07Z

export USE_DEBUG=1 export USE_F90_ALLOCATABLE=y export USE_OPENACC_TRPDRV=y export NWCHEM_LINK_CUDA=y

My link problem vanished after having installed a more recent nvhpc release

omarkahmed · 2024-08-07T16:59:15Z

Please revert all changes to the OpenACC.F code. They are not good.

@jeffhammond , I reorganized the commits to eliminate the changes to the cuda version. Please let me know if you have any other requests.

jeffhammond · 2024-08-07T17:13:27Z

Thanks. This good for me.

omarkahmed force-pushed the omarkahmed/ccsd_offload_batched_gemm_rebase branch from c66b913 to c697434 Compare July 9, 2024 00:00

edoapra requested a review from jeffhammond July 29, 2024 16:35

jeffhammond requested changes Jul 30, 2024

View reviewed changes

edoapra force-pushed the omarkahmed/ccsd_offload_batched_gemm_rebase branch from 1d079a7 to 2dc63c1 Compare July 31, 2024 00:20

omarkahmed added 4 commits August 5, 2024 16:20

Update with batched gemm for openmp offload ccsd(t) implementation

f3b8996

Remove extra build options related to non-existing c sources

94ef449

Seperate out DEFINES and INCLUDES from FOPTIONS

a30fcbc

Remove default options for Intel Xe Max

e7a8e04

omarkahmed force-pushed the omarkahmed/ccsd_offload_batched_gemm_rebase branch from 2dc63c1 to e7a8e04 Compare August 5, 2024 23:32

edoapra merged commit b2bd5c6 into nwchemgit:master Aug 7, 2024
62 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batched dgemm to CCSD(T) for openmp offload and openacc implementations #980

Add batched dgemm to CCSD(T) for openmp offload and openacc implementations #980

omarkahmed commented Jul 8, 2024

edoapra commented Jul 8, 2024

omarkahmed commented Jul 9, 2024

edoapra commented Jul 9, 2024 •

edited

Loading

omarkahmed commented Jul 9, 2024

edoapra commented Jul 9, 2024

jeffhammond commented Jul 9, 2024

omarkahmed commented Jul 9, 2024

jeffhammond commented Jul 9, 2024

edoapra commented Jul 26, 2024

edoapra commented Jul 26, 2024

jeffhammond commented Jul 30, 2024

edoapra commented Jul 30, 2024

jeffhammond commented Jul 30, 2024

edoapra commented Jul 30, 2024

jeffhammond left a comment

jeffhammond commented Jul 30, 2024

omarkahmed commented Jul 30, 2024

jeffhammond commented Jul 30, 2024 via email

edoapra commented Jul 30, 2024 •

edited

Loading

omarkahmed commented Aug 7, 2024

jeffhammond commented Aug 7, 2024

Add batched dgemm to CCSD(T) for openmp offload and openacc implementations #980

Add batched dgemm to CCSD(T) for openmp offload and openacc implementations #980

Conversation

omarkahmed commented Jul 8, 2024

edoapra commented Jul 8, 2024

omarkahmed commented Jul 9, 2024

edoapra commented Jul 9, 2024 • edited Loading

omarkahmed commented Jul 9, 2024

edoapra commented Jul 9, 2024

jeffhammond commented Jul 9, 2024

omarkahmed commented Jul 9, 2024

jeffhammond commented Jul 9, 2024

edoapra commented Jul 26, 2024

edoapra commented Jul 26, 2024

jeffhammond commented Jul 30, 2024

edoapra commented Jul 30, 2024

jeffhammond commented Jul 30, 2024

edoapra commented Jul 30, 2024

jeffhammond left a comment

Choose a reason for hiding this comment

jeffhammond commented Jul 30, 2024

omarkahmed commented Jul 30, 2024

jeffhammond commented Jul 30, 2024 via email

edoapra commented Jul 30, 2024 • edited Loading

omarkahmed commented Aug 7, 2024

jeffhammond commented Aug 7, 2024

edoapra commented Jul 9, 2024 •

edited

Loading

edoapra commented Jul 30, 2024 •

edited

Loading