Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BaseFab::lockAdd: Faster version of BaseFab::atomicAdd for OpenMP #3696

Merged
merged 1 commit into from
Jan 12, 2024

Conversation

WeiqunZhang
Copy link
Member

@WeiqunZhang WeiqunZhang commented Jan 11, 2024

For WarpX's Gordon Bell runs on Fugaku, Stephan Jaure of ATOS optimized atomicAdd using pthread spin locks. This commit implements Stephan's approach using OpenMP.

In AMReX::Initialize, we create a number of OMP locks. When BaseFab::lockAdd is called, we loop over planes in z-direction and try to acquire a lock with omp_test_lock. If it's successful, we can access the data in that z-plane without worrying about race conditions. This allows us to use simd instructions instead of omp atomic adds. If it's not successful, we will try a different z-plane. The process stops till all planes are processed.

For WarpX's Gordon Bell runs on Fugaku, Stephan Jaure of ATOS optimized
atomicAdd using pthread spin locks. This commit implements Stephan's
approach using OpenMP.

In AMReX::Initialize, we create a number of OMP locks. When BaseFab::lockAdd
is called, we loop over planes in z-direction and try to acquire a lock with
omp_test_lock. If it's successful, we can access the data in that z-plane
without worrying about race conditions. This allows us to use simd
instructions without using omp atomic add. If it's not successful, we will
try a different z-plane. The process stops till all planes are processed.
@WeiqunZhang WeiqunZhang marked this pull request as ready for review January 12, 2024 18:17
@WeiqunZhang
Copy link
Member Author

I have used a test at https://github.com/WeiqunZhang/amrex-devtests/blob/main/fab_atomicAdd/main.cpp on my desktop machine and Frontier. The result look good. The new version is up to 10x faster, and its performance is similar to non-atomic add (which of course does not produce correct results).

@WeiqunZhang WeiqunZhang merged commit 255d30f into AMReX-Codes:development Jan 12, 2024
69 checks passed
@WeiqunZhang WeiqunZhang deleted the omp_lock branch January 12, 2024 22:48
@ax3l ax3l mentioned this pull request Feb 3, 2024
3 tasks
while (planes_left > 0) {
AMREX_ASSERT(mm < nplanes);
auto const m = mm + dlo[AMREX_SPACEDIM-1];
int ilock = m % OpenMP::nlocks;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a runtime assert in this function if the active OMP threads exceed OpenMP::nlocks?

Copy link
Member Author

@WeiqunZhang WeiqunZhang Feb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threads are competing for locks. Even if OpenMP::nlocks is 1, it would still work with any number of threads.

void Initialize ();
void Finalize ();

static constexpr int nlocks = 128;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A short inline comment what the relation between nlocks and number of OMP threads is (constraints or assumptions or performance impact) would be good here, I think.

@ax3l
Copy link
Member

ax3l commented Feb 3, 2024

X-ref a Windows symbol linker regression in the 24.02 release:
conda-forge/warpx-feedstock#80 (comment)

I think we need to either write

// .H
extern AMREX_EXPORT std::array<omp_lock_t,nlocks> omp_locks;

// .cpp
AMREX_EXPORT std::array<omp_lock_t,nlocks> omp_locks;

or

// .H
AMREX_EXPORT std::array<omp_lock_t,nlocks> omp_locks;

// .cpp
std::array<omp_lock_t,nlocks> omp_locks;

or

// .H
// not sure that works...
extern "C" AMREX_EXPORT std::array<omp_lock_t,nlocks> omp_locks;

// .cpp
std::array<omp_lock_t,nlocks> omp_locks;

because this involves the STL in the type...

@ax3l
Copy link
Member

ax3l commented Feb 3, 2024

To understand this better: why is this global variable declared extern and not static (or no annotation)? Do we need to change it from across TUs?

@WeiqunZhang
Copy link
Member Author

If it's static, we will have a copy for every TU. It would be a waste and it would be tricky (though possible) to initialize them properly.

@ax3l
Copy link
Member

ax3l commented Feb 3, 2024

Got it, thx!

I wonder if we might need to use a plain C array here instead of std::array?

@WeiqunZhang
Copy link
Member Author

std::array is probably fine. We have Src/Base/AMReX.H: extern AMREX_EXPORT std::string exename; and that was not an issue. So we just need to add AMREX_EXPORT. I don't think we need it in cpp file.

@WeiqunZhang
Copy link
Member Author

We need to add a CI to check the windows link issue.

@ax3l
Copy link
Member

ax3l commented Feb 3, 2024

So we just need to add AMREX_EXPORT. I don't think we need it in cpp file.

I wonder if I miss it: where would you add AMREX_EXPORT in this patch? (Don't we declare it already in the .H file?)

@ax3l
Copy link
Member

ax3l commented Feb 3, 2024

We need to add a CI to check the windows link issue.

I am a bit surprised we did not catch it in WarpX. We have a CI entry "Clang C++17 w/ OMP w/o MPI" with ClangCl that builds shared and with OpenMP, just like in conda-forge...

@WeiqunZhang
Copy link
Member Author

Oh, I missed it. I though the issue was I forgot about adding AMREX_EXPORT.

@ax3l
Copy link
Member

ax3l commented Feb 3, 2024

I wish, but this patch looks so clean, I cannot spot it yet :D

Could also be a missing include, but they look clean here as well.

@WeiqunZhang
Copy link
Member Author

Maybe amrex was not built with openmp support?

@ax3l
Copy link
Member

ax3l commented Feb 3, 2024

Yes, or the compiler or OpenMP backend implementation were mismatched. I cannot see that yet, I checked compiler options and looks identical:
conda-forge/warpx-feedstock#80 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants