port trczdf to GPU #3

dindon-sournois · 2024-04-15T21:13:26Z

No description provided.

dindon-sournois · 2024-04-16T07:29:14Z

src/PHYS/ZDF_mem.f90

+#ifdef _OPENACC
+      subroutine myalloc_ZDF_gpu()
+        allocate(zwd(jpk, dimen_jvzdf))
+        zwd          = huge(zwd(1,1))
+        allocate(zws(jpk, dimen_jvzdf))
+        zws          = huge(zws(1,1))
+        allocate(zwi(jpk, dimen_jvzdf))
+        zwi          = huge(zwi(1,1))
+        allocate(zwx(jpk, dimen_jvzdf))
+        zwx          = huge(zwx(1,1))
+        allocate(zwy(jpk, dimen_jvzdf))
+        zwy          = huge(zwy(1,1))
+        allocate(zwz(jpk, dimen_jvzdf))
+        zwz          = huge(zwz(1,1))
+        allocate(zwt(jpk, dimen_jvzdf))
+        zwt          = huge(zwt(1,1))
+
+        !$acc enter data create(zwd,zwi,zwx,zws,zwz,zwy,zwt)
+        !$acc update device(zwd,zwi,zwx,zws,zwz,zwy,zwt)
+      END subroutine myalloc_ZDF_gpu
+#endif


We create a new subroutine here that is called once in trczdf after dimen_jvzdf value is known

We could probably do the same for the CPU version to avoid duplicates, also the memory counter might needs to be adapted

dindon-sournois · 2024-04-16T07:32:02Z

src/PHYS/trcadv.f90

-  !$acc enter data create( e1t(1:jpj,1:jpi), e2t(1:jpj,1:jpi), e3t(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
  !$acc enter data create( e1u(1:jpj,1:jpi), e2u(1:jpj,1:jpi), e3u(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
  !$acc enter data create( e1v(1:jpj,1:jpi), e2v(1:jpj,1:jpi), e3v(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
-  !$acc enter data create( e3w(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
  !$acc enter data create( un(1:jpk,1:jpj,1:jpi), vn(1:jpk,1:jpj,1:jpi), wn(1:jpk,1:jpj,1:jpi) ) if(use_gpu)


it's not a good idea to declare these arrays here:

they are allocated and deallocated later, which is a waste of time

GPU allocation should be moved close to CPU allocate as the port progress

dindon-sournois · 2024-04-16T07:33:45Z

src/PHYS/trczdf.f90

+        ! NOTE: kernel is too big, should be split
+        !$acc parallel loop gang vector default(present) async vector_length(32)


we might want to think about clever ways to generate this kernel as it seems quite big, best performance on A100 was obtained with a vector length of 32 which isn't very high

dindon-sournois · 2024-04-16T07:34:19Z

src/PHYS/trczdf.f90

        DO jv = 1, dimen_jvzdf

           ji  = jarr_zdf(2,jv)
           jj  = jarr_zdf(1,jv)
           Aij = e1t(jj,ji) * e2t(jj,ji)

+#ifdef _OPENACC
+           ntx=jv


for GPU version we parallelize on dimen_jvzdf

dindon-sournois added 2 commits April 15, 2024 23:10

port trczdf to GPU

444e5f3

reduce vector length

59e50e2

dindon-sournois self-assigned this Apr 16, 2024

dindon-sournois requested a review from stefanocampanella April 16, 2024 07:29

dindon-sournois commented Apr 16, 2024

View reviewed changes

dindon-sournois mentioned this pull request Apr 24, 2024

Trcadv async + serial kernel fix #4

Open

dindon-sournois marked this pull request as ready for review April 24, 2024 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

port trczdf to GPU #3

port trczdf to GPU #3

dindon-sournois commented Apr 15, 2024

dindon-sournois Apr 16, 2024

dindon-sournois Apr 16, 2024

dindon-sournois Apr 16, 2024

dindon-sournois Apr 16, 2024

		! NOTE: kernel is too big, should be split
		!$acc parallel loop gang vector default(present) async vector_length(32)

port trczdf to GPU #3

Are you sure you want to change the base?

port trczdf to GPU #3

Conversation

dindon-sournois commented Apr 15, 2024

dindon-sournois Apr 16, 2024

Choose a reason for hiding this comment

dindon-sournois Apr 16, 2024

Choose a reason for hiding this comment

dindon-sournois Apr 16, 2024

Choose a reason for hiding this comment

dindon-sournois Apr 16, 2024

Choose a reason for hiding this comment