Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global interpolation without GSLIB #1542

Open
wants to merge 155 commits into
base: develop
Choose a base branch
from

Conversation

MartinKarp
Copy link
Collaborator

@MartinKarp MartinKarp commented Oct 11, 2024

Global interpolation without gslib. Same functionality as previously + GPU support, but no need to build with gslib any more. Removes gslib entirely from the codebase. I am now happy enough with it that I think we can start tying to get it into develop. It will continue to be optimized during the spring.

Features

  • Machinery for finding rst coordinates on the CPU/GPU without GSLIB
  • Options to execute interpolation on the host rather than device
  • Functions to copy data back and forth between host and device for vectors, fields and matrices
  • Unit tests for global interpolation

Todo before merge

  • Clean code a bit more and double check for comments.
  • Understand issue with Mac double precision test.

@MartinKarp MartinKarp changed the title Global interpolation no gslib v0.1 Global interpolation no gslib v0.1 -> v1.0 Oct 11, 2024
@timofeymukha
Copy link
Collaborator

@MartinKarp related to the bug with object init, we should probably have an issue where we list "offenders", so we can refactor them to an init routine.

@MartinKarp MartinKarp marked this pull request as ready for review October 21, 2024 14:51
@MartinKarp
Copy link
Collaborator Author

MartinKarp commented Oct 21, 2024

Think this is more or less ready on the CPU. This is truly a "develop" PR in the sense it is far from perfect. It alleviates the gslib dependency though. I think until I add functionality like fast GPU support we can wait with merging this.

Comment on lines +260 to +261
!Isnt this a datarace?
!How do things like this ever get into the code...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is not a race, since there's no parallelism in the loop (do concurrent != omp parallel do)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I understood incrorrectly then, but to me there is a data dependency on u(sp(i)) or? I thought do concurrent specified that there are no dependencies.

https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2023-1/do-concurrent.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would be true, but we’re working with the gathered data so there’s no duplicates. So concurrent is needed to Force vectorisation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there might be several values to unpack from one rank to the same u(sp(i)) right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we're are only communicating unique, gathered dofs, so from a single neighbour there will only be one place to update. The accumulation is needed to account for shared dofs received from other ranks, but for host based MPI each neighbour has an own buffer, so there's no data dependencies in each unpack loop. (What can't be done is to have multithreaded unpack, without critical sections around the updates)

The shorter array with received data is later scattered, which has the one to many update pattern

Comment on lines -148 to +150
do concurrent (j = 1:this%send_dof(dst)%size())
do j = 1,this%send_dof(dst)%size()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this need to be changed? (and preventing optimisation of the loop filling the buffer)

Comment on lines -154 to +157
associate(send_data => this%send_buf(i)%data)
call MPI_Isend(send_data, size(send_data), &
!associate(send_data => this%send_buf(i)%data)
call MPI_Isend(this%send_buf(i)%data, this%send_dof(dst)%size(), &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please keep this associate block such that the code would still work with NAG fortran

Copy link
Collaborator Author

@MartinKarp MartinKarp Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was just me trying to find why Isend was so slow for many messages on Dardel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request GPU GPU refactor
Projects
Status: 📋 Todo
Development

Successfully merging this pull request may close these issues.

5 participants