Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port trczdf to GPU #3

Open
wants to merge 2 commits into
base: dev_gpu
Choose a base branch
from
Open

port trczdf to GPU #3

wants to merge 2 commits into from

Conversation

dindon-sournois
Copy link
Collaborator

No description provided.

Comment on lines +75 to +95
#ifdef _OPENACC
subroutine myalloc_ZDF_gpu()
allocate(zwd(jpk, dimen_jvzdf))
zwd = huge(zwd(1,1))
allocate(zws(jpk, dimen_jvzdf))
zws = huge(zws(1,1))
allocate(zwi(jpk, dimen_jvzdf))
zwi = huge(zwi(1,1))
allocate(zwx(jpk, dimen_jvzdf))
zwx = huge(zwx(1,1))
allocate(zwy(jpk, dimen_jvzdf))
zwy = huge(zwy(1,1))
allocate(zwz(jpk, dimen_jvzdf))
zwz = huge(zwz(1,1))
allocate(zwt(jpk, dimen_jvzdf))
zwt = huge(zwt(1,1))

!$acc enter data create(zwd,zwi,zwx,zws,zwz,zwy,zwt)
!$acc update device(zwd,zwi,zwx,zws,zwz,zwy,zwt)
END subroutine myalloc_ZDF_gpu
#endif
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a new subroutine here that is called once in trczdf after dimen_jvzdf value is known

We could probably do the same for the CPU version to avoid duplicates, also the memory counter might needs to be adapted

Comment on lines -177 to 179
!$acc enter data create( e1t(1:jpj,1:jpi), e2t(1:jpj,1:jpi), e3t(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
!$acc enter data create( e1u(1:jpj,1:jpi), e2u(1:jpj,1:jpi), e3u(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
!$acc enter data create( e1v(1:jpj,1:jpi), e2v(1:jpj,1:jpi), e3v(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
!$acc enter data create( e3w(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
!$acc enter data create( un(1:jpk,1:jpj,1:jpi), vn(1:jpk,1:jpj,1:jpi), wn(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not a good idea to declare these arrays here:

  • they are allocated and deallocated later, which is a waste of time
  • GPU allocation should be moved close to CPU allocate as the port progress

Comment on lines +136 to +137
! NOTE: kernel is too big, should be split
!$acc parallel loop gang vector default(present) async vector_length(32)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to think about clever ways to generate this kernel as it seems quite big, best performance on A100 was obtained with a vector length of 32 which isn't very high

DO jv = 1, dimen_jvzdf

ji = jarr_zdf(2,jv)
jj = jarr_zdf(1,jv)
Aij = e1t(jj,ji) * e2t(jj,ji)

#ifdef _OPENACC
ntx=jv
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for GPU version we parallelize on dimen_jvzdf

@dindon-sournois dindon-sournois marked this pull request as ready for review April 24, 2024 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant