Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCSD_T2_7 GPU support #1029

Closed
wants to merge 113 commits into from

Conversation

jeffhammond
Copy link
Collaborator

@jeffhammond jeffhammond commented Oct 16, 2024

This does the T2_7 terms on GPU. TCE_SORT uses DO CONCURRENT and is strictly GPU agnostic at a source level, but compiles to GPU code with the NVIDIA Fortran compiler. The DGEMM stuff uses the automatic CUBLAS offload and compiles for CPU as well.

This would benefit from asynchrony but I'll that later. Similarly, this will benefit ICSD/NTS as well, but that's a future PR.

Performance, again on AMD 7950X and RTX 4090, for 4(H2O) with cc-pVQZ, with 16 MPI processes (ARMCI-MPI):

CPU

CCSD iterations
 -----------------------------------------------------------------
 Iter          Residuum       Correlation     Cpu    Wall    V2*C2
 -----------------------------------------------------------------
    1   0.2963584119497  -1.1503632882976   228.5   228.5   133.4
           ccsd_t2_1:8+c2f      0.095   13.579    6.707    0.667    9.497    0.936   51.310  133.359    0.472
    2   0.0616536970547  -1.1400056629788   227.6   227.6   132.8
           ccsd_t2_1:8+c2f      0.088   13.458    6.744    0.655    9.407    0.913   51.252  132.828    0.456
    3   0.0200706083152  -1.1564661687147   227.6   227.6   132.8
           ccsd_t2_1:8+c2f      0.086   13.464    6.765    0.649    9.403    0.914   51.325  132.844    0.453
    4   0.0077536091138  -1.1573100784705   228.4   228.4   132.9
           ccsd_t2_1:8+c2f      0.089   13.588    6.744    0.656    9.801    0.905   51.511  132.940    0.460
 -----------------------------------------------------------------
 Iterations converged
 CCSD correlation energy / hartree =        -1.157310078470523
 CCSD total energy / hartree       =      -305.445420807287917
 CCSD iterations
 -----------------------------------------------------------------
 Iter          Residuum       Correlation     Cpu    Wall    V2*C2
 -----------------------------------------------------------------
    1   0.2963584119498  -1.1503632882976   160.6   160.6    78.0
           ccsd_t2_1:8+c2f      0.098   13.566    6.736    0.753    9.482    0.933   38.425   77.992    0.469
    2   0.0616536970547  -1.1400056629788   159.2   159.2    77.3
           ccsd_t2_1:8+c2f      0.089   13.498    6.737    0.659    9.422    0.921   38.390   77.321    0.472
    3   0.0200706083152  -1.1564661687148   158.8   158.8    77.0
           ccsd_t2_1:8+c2f      0.088   13.416    6.707    0.639    9.468    0.923   38.358   77.023    0.466
    4   0.0077536091138  -1.1573100784706   158.8   158.8    76.8
           ccsd_t2_1:8+c2f      0.089   13.472    6.764    0.646    9.427    0.908   38.411   76.783    0.468
 -----------------------------------------------------------------
 Iterations converged
 CCSD correlation energy / hartree =        -1.157310078470578
 CCSD total energy / hartree       =      -305.445420807287974

For (H2O)6 with cc-pVTZ, the T2_7 is closer to 2x (28 seconds vs 53 seconds).

@jeffhammond
Copy link
Collaborator Author

This includes a fix to the T2_8 GPU code as well.

@edoapra
Copy link
Collaborator

edoapra commented Oct 25, 2024

@jeffhammond Could you please rebase and squash commit when you are ready to go?

@jeffhammond
Copy link
Collaborator Author

I tried to rebase earlier. It was impossible. I'll have to rip it apart and create the PR from scratch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants