-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BaseFab::lockAdd: Faster version of BaseFab::atomicAdd for OpenMP #3696
Conversation
For WarpX's Gordon Bell runs on Fugaku, Stephan Jaure of ATOS optimized atomicAdd using pthread spin locks. This commit implements Stephan's approach using OpenMP. In AMReX::Initialize, we create a number of OMP locks. When BaseFab::lockAdd is called, we loop over planes in z-direction and try to acquire a lock with omp_test_lock. If it's successful, we can access the data in that z-plane without worrying about race conditions. This allows us to use simd instructions without using omp atomic add. If it's not successful, we will try a different z-plane. The process stops till all planes are processed.
I have used a test at https://github.com/WeiqunZhang/amrex-devtests/blob/main/fab_atomicAdd/main.cpp on my desktop machine and Frontier. The result look good. The new version is up to 10x faster, and its performance is similar to non-atomic add (which of course does not produce correct results). |
while (planes_left > 0) { | ||
AMREX_ASSERT(mm < nplanes); | ||
auto const m = mm + dlo[AMREX_SPACEDIM-1]; | ||
int ilock = m % OpenMP::nlocks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a runtime assert in this function if the active OMP threads exceed OpenMP::nlocks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The threads are competing for locks. Even if OpenMP::nlocks is 1, it would still work with any number of threads.
void Initialize (); | ||
void Finalize (); | ||
|
||
static constexpr int nlocks = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A short inline comment what the relation between nlocks
and number of OMP threads is (constraints or assumptions or performance impact) would be good here, I think.
X-ref a Windows symbol linker regression in the 24.02 release: I think we need to either write // .H
extern AMREX_EXPORT std::array<omp_lock_t,nlocks> omp_locks;
// .cpp
AMREX_EXPORT std::array<omp_lock_t,nlocks> omp_locks; or // .H
AMREX_EXPORT std::array<omp_lock_t,nlocks> omp_locks;
// .cpp
std::array<omp_lock_t,nlocks> omp_locks; or // .H
// not sure that works...
extern "C" AMREX_EXPORT std::array<omp_lock_t,nlocks> omp_locks;
// .cpp
std::array<omp_lock_t,nlocks> omp_locks; because this involves the STL in the type... |
To understand this better: why is this global variable declared |
If it's static, we will have a copy for every TU. It would be a waste and it would be tricky (though possible) to initialize them properly. |
Got it, thx! I wonder if we might need to use a plain C array here instead of |
|
We need to add a CI to check the windows link issue. |
I wonder if I miss it: where would you add |
I am a bit surprised we did not catch it in WarpX. We have a CI entry "Clang C++17 w/ OMP w/o MPI" with ClangCl that builds shared and with OpenMP, just like in conda-forge... |
Oh, I missed it. I though the issue was I forgot about adding |
I wish, but this patch looks so clean, I cannot spot it yet :D Could also be a missing include, but they look clean here as well. |
Maybe amrex was not built with openmp support? |
Yes, or the compiler or OpenMP backend implementation were mismatched. I cannot see that yet, I checked compiler options and looks identical: |
For WarpX's Gordon Bell runs on Fugaku, Stephan Jaure of ATOS optimized atomicAdd using pthread spin locks. This commit implements Stephan's approach using OpenMP.
In AMReX::Initialize, we create a number of OMP locks. When BaseFab::lockAdd is called, we loop over planes in z-direction and try to acquire a lock with omp_test_lock. If it's successful, we can access the data in that z-plane without worrying about race conditions. This allows us to use simd instructions instead of omp atomic adds. If it's not successful, we will try a different z-plane. The process stops till all planes are processed.