CCSD_T2_7 GPU support #1029

jeffhammond · 2024-10-16T09:33:46Z

This does the T2_7 terms on GPU. TCE_SORT uses DO CONCURRENT and is strictly GPU agnostic at a source level, but compiles to GPU code with the NVIDIA Fortran compiler. The DGEMM stuff uses the automatic CUBLAS offload and compiles for CPU as well.

This would benefit from asynchrony but I'll that later. Similarly, this will benefit ICSD/NTS as well, but that's a future PR.

Performance, again on AMD 7950X and RTX 4090, for 4(H2O) with cc-pVQZ, with 16 MPI processes (ARMCI-MPI):

CPU

CCSD iterations
 -----------------------------------------------------------------
 Iter          Residuum       Correlation     Cpu    Wall    V2*C2
 -----------------------------------------------------------------
    1   0.2963584119497  -1.1503632882976   228.5   228.5   133.4
           ccsd_t2_1:8+c2f      0.095   13.579    6.707    0.667    9.497    0.936   51.310  133.359    0.472
    2   0.0616536970547  -1.1400056629788   227.6   227.6   132.8
           ccsd_t2_1:8+c2f      0.088   13.458    6.744    0.655    9.407    0.913   51.252  132.828    0.456
    3   0.0200706083152  -1.1564661687147   227.6   227.6   132.8
           ccsd_t2_1:8+c2f      0.086   13.464    6.765    0.649    9.403    0.914   51.325  132.844    0.453
    4   0.0077536091138  -1.1573100784705   228.4   228.4   132.9
           ccsd_t2_1:8+c2f      0.089   13.588    6.744    0.656    9.801    0.905   51.511  132.940    0.460
 -----------------------------------------------------------------
 Iterations converged
 CCSD correlation energy / hartree =        -1.157310078470523
 CCSD total energy / hartree       =      -305.445420807287917

 CCSD iterations
 -----------------------------------------------------------------
 Iter          Residuum       Correlation     Cpu    Wall    V2*C2
 -----------------------------------------------------------------
    1   0.2963584119498  -1.1503632882976   160.6   160.6    78.0
           ccsd_t2_1:8+c2f      0.098   13.566    6.736    0.753    9.482    0.933   38.425   77.992    0.469
    2   0.0616536970547  -1.1400056629788   159.2   159.2    77.3
           ccsd_t2_1:8+c2f      0.089   13.498    6.737    0.659    9.422    0.921   38.390   77.321    0.472
    3   0.0200706083152  -1.1564661687148   158.8   158.8    77.0
           ccsd_t2_1:8+c2f      0.088   13.416    6.707    0.639    9.468    0.923   38.358   77.023    0.466
    4   0.0077536091138  -1.1573100784706   158.8   158.8    76.8
           ccsd_t2_1:8+c2f      0.089   13.472    6.764    0.646    9.427    0.908   38.411   76.783    0.468
 -----------------------------------------------------------------
 Iterations converged
 CCSD correlation energy / hartree =        -1.157310078470578
 CCSD total energy / hartree       =      -305.445420807287974

For (H2O)6 with cc-pVTZ, the T2_7 is closer to 2x (28 seconds vs 53 seconds).

…nd/nwchem into ccsd_t2_dgemm_cublas_7

this is an O(N^5) term, so it probably doesn't matter... Signed-off-by: Jeff Hammond <[email protected]>

Signed-off-by: Jeff Hammond <[email protected]>

sync on oldphase was wrong. sync on the phase associated with the buffers about to be overwritten. remove oldphase variable since no longer needed. fuse the two accumulates at the end since accumulate is more expensive than memcpy. Signed-off-by: Jeff Hammond <[email protected]>

jeffhammond · 2024-10-23T17:13:21Z

This includes a fix to the T2_8 GPU code as well.

edoapra · 2024-10-25T15:41:52Z

@jeffhammond Could you please rebase and squash commit when you are ready to go?

jeffhammond · 2024-10-25T15:43:03Z

I tried to rebase earlier. It was impossible. I'll have to rip it apart and create the PR from scratch.

jeffhammond added 30 commits October 5, 2023 21:10

this works

43ebd7f

add DGEMM version too

2fa9876

straight DGEMM works

8062b6d

remove the loops - DGEMM will always be better

e80bac5

removing loops

fc17d91

cleanup

b3ce4c1

do the pure DGEMM T2_8 in ICSD/NTS too

567fd44

move makefile include to the top so we can use its vars

0a1b133

still debugging

6de9fbe

so far, so good

e78fb8e

so far, so good

9f12e90

okay, it works correctly now

409ba37

okay, it works correctly now

360e2e7

now time for double buffering

355a2c8

clean up

74ab406

arrays are column major. wow.

34242be

n stream version using n=1

9335011

n stream version using n=1

2b18f0c

2 phase version is correct

12f47c9

comment syntax

7cf47de

move T2_7 into separate file

aec8780

move the parent 7 term

d34e8a0

make it pretty

62b5b91

more formatting

9ebe5ff

more formatting

40d25c8

more formatting; code copy for _x version

a88ecc5

alloc hoist

bbb9b8b

alloc reduction

35f5619

alloc reduction

307156f

remove pointless DGEMM from ccsd_e

707c738

jeffhammond added 24 commits October 15, 2024 10:48

this works

fc0a851

this works

c0b4a97

this works

b1b5eb5

this works with stdpar=gpu

58752ad

input

fe7c87a

add CCSD term timers

eed2c64

another T2_7 term runs on GPU

9954c5d

test

c05af60

fix misplaced preprocessing and add _y version

29df905

stdpar sort

745c033

GPU stdpar sort

4676886

massive cleanup of GPU T2 7

ba2631d

cleanup T2 7 further

534d78a

performance

03b44bf

preprocessor guard on CUBLAS module

ff20a9d

reset generic input file

39070b5

Merge branch 'ccsd_t2_dgemm_cublas_7' of https://github.com/jeffhammo…

ba3242e

…nd/nwchem into ccsd_t2_dgemm_cublas_7

oops

2ddc447

add 2D sorts for GPU

aee190d

GPU version of ccsd_t2_7_2_x

7c1459f

this is an O(N^5) term, so it probably doesn't matter... Signed-off-by: Jeff Hammond <[email protected]>

remove commented-out code

c562fc8

Signed-off-by: Jeff Hammond <[email protected]>

add alg that does O(N^5) on CPU but other on GPU, which is slower

f3b5539

reset

2904ac6

jeffhammond added 2 commits October 25, 2024 17:37

add 2 phase ccsd_t2_7_3 and broken cublas of the same

9d4040a

almost there, i hope

450b52e

jeffhammond closed this Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCSD_T2_7 GPU support #1029

CCSD_T2_7 GPU support #1029

jeffhammond commented Oct 16, 2024 •

edited

Loading

jeffhammond commented Oct 23, 2024

edoapra commented Oct 25, 2024

jeffhammond commented Oct 25, 2024

CCSD_T2_7 GPU support #1029

CCSD_T2_7 GPU support #1029

Conversation

jeffhammond commented Oct 16, 2024 • edited Loading

jeffhammond commented Oct 23, 2024

edoapra commented Oct 25, 2024

jeffhammond commented Oct 25, 2024

jeffhammond commented Oct 16, 2024 •

edited

Loading