BaseFab::lockAdd: Faster version of BaseFab::atomicAdd for OpenMP #3696

WeiqunZhang · 2024-01-11T21:52:16Z

For WarpX's Gordon Bell runs on Fugaku, Stephan Jaure of ATOS optimized atomicAdd using pthread spin locks. This commit implements Stephan's approach using OpenMP.

In AMReX::Initialize, we create a number of OMP locks. When BaseFab::lockAdd is called, we loop over planes in z-direction and try to acquire a lock with omp_test_lock. If it's successful, we can access the data in that z-plane without worrying about race conditions. This allows us to use simd instructions instead of omp atomic adds. If it's not successful, we will try a different z-plane. The process stops till all planes are processed.

For WarpX's Gordon Bell runs on Fugaku, Stephan Jaure of ATOS optimized atomicAdd using pthread spin locks. This commit implements Stephan's approach using OpenMP. In AMReX::Initialize, we create a number of OMP locks. When BaseFab::lockAdd is called, we loop over planes in z-direction and try to acquire a lock with omp_test_lock. If it's successful, we can access the data in that z-plane without worrying about race conditions. This allows us to use simd instructions without using omp atomic add. If it's not successful, we will try a different z-plane. The process stops till all planes are processed.

WeiqunZhang · 2024-01-12T18:26:21Z

I have used a test at https://github.com/WeiqunZhang/amrex-devtests/blob/main/fab_atomicAdd/main.cpp on my desktop machine and Frontier. The result look good. The new version is up to 10x faster, and its performance is similar to non-atomic add (which of course does not produce correct results).

ax3l · 2024-02-03T20:49:17Z

Src/Base/AMReX_BaseFab.H

+        while (planes_left > 0) {
+            AMREX_ASSERT(mm < nplanes);
+            auto const m = mm + dlo[AMREX_SPACEDIM-1];
+            int ilock = m % OpenMP::nlocks;


Should there be a runtime assert in this function if the active OMP threads exceed OpenMP::nlocks?

The threads are competing for locks. Even if OpenMP::nlocks is 1, it would still work with any number of threads.

ax3l · 2024-02-03T20:50:25Z

Src/Base/AMReX_OpenMP.H

+    void Initialize ();
+    void Finalize ();
+
+    static constexpr int nlocks = 128;


A short inline comment what the relation between nlocks and number of OMP threads is (constraints or assumptions or performance impact) would be good here, I think.

ax3l · 2024-02-03T21:01:30Z

X-ref a Windows symbol linker regression in the 24.02 release:
conda-forge/warpx-feedstock#80 (comment)

I think we need to either write

// .H
extern AMREX_EXPORT std::array<omp_lock_t,nlocks> omp_locks;

// .cpp
AMREX_EXPORT std::array<omp_lock_t,nlocks> omp_locks;

or

// .H
AMREX_EXPORT std::array<omp_lock_t,nlocks> omp_locks;

// .cpp
std::array<omp_lock_t,nlocks> omp_locks;

or

// .H
// not sure that works...
extern "C" AMREX_EXPORT std::array<omp_lock_t,nlocks> omp_locks;

// .cpp
std::array<omp_lock_t,nlocks> omp_locks;

because this involves the STL in the type...

ax3l · 2024-02-03T21:14:44Z

To understand this better: why is this global variable declared extern and not static (or no annotation)? Do we need to change it from across TUs?

WeiqunZhang · 2024-02-03T21:17:39Z

If it's static, we will have a copy for every TU. It would be a waste and it would be tricky (though possible) to initialize them properly.

ax3l · 2024-02-03T21:20:09Z

Got it, thx!

I wonder if we might need to use a plain C array here instead of std::array?

WeiqunZhang · 2024-02-03T21:22:49Z

std::array is probably fine. We have Src/Base/AMReX.H: extern AMREX_EXPORT std::string exename; and that was not an issue. So we just need to add AMREX_EXPORT. I don't think we need it in cpp file.

WeiqunZhang · 2024-02-03T21:24:24Z

We need to add a CI to check the windows link issue.

ax3l · 2024-02-03T21:25:14Z

So we just need to add AMREX_EXPORT. I don't think we need it in cpp file.

I wonder if I miss it: where would you add AMREX_EXPORT in this patch? (Don't we declare it already in the .H file?)

ax3l · 2024-02-03T21:27:54Z

We need to add a CI to check the windows link issue.

I am a bit surprised we did not catch it in WarpX. We have a CI entry "Clang C++17 w/ OMP w/o MPI" with ClangCl that builds shared and with OpenMP, just like in conda-forge...

WeiqunZhang · 2024-02-03T21:28:09Z

Oh, I missed it. I though the issue was I forgot about adding AMREX_EXPORT.

ax3l · 2024-02-03T21:28:32Z

I wish, but this patch looks so clean, I cannot spot it yet :D

Could also be a missing include, but they look clean here as well.

WeiqunZhang · 2024-02-03T21:48:38Z

Maybe amrex was not built with openmp support?

ax3l · 2024-02-03T22:29:55Z

Yes, or the compiler or OpenMP backend implementation were mismatched. I cannot see that yet, I checked compiler options and looks identical:
conda-forge/warpx-feedstock#80 (comment)

WeiqunZhang requested review from ax3l and atmyers January 11, 2024 21:52

WeiqunZhang force-pushed the omp_lock branch from 14c859e to c124af0 Compare January 12, 2024 04:10

WeiqunZhang force-pushed the omp_lock branch from c124af0 to 163f114 Compare January 12, 2024 17:48

WeiqunZhang marked this pull request as ready for review January 12, 2024 18:17

atmyers approved these changes Jan 12, 2024

View reviewed changes

WeiqunZhang merged commit 255d30f into AMReX-Codes:development Jan 12, 2024
69 checks passed

WeiqunZhang deleted the omp_lock branch January 12, 2024 22:48

ax3l mentioned this pull request Feb 3, 2024

warpx v24.02 conda-forge/warpx-feedstock#80

Merged

3 tasks

ax3l added the performance label Feb 3, 2024

ax3l reviewed Feb 3, 2024

View reviewed changes

This was referenced Mar 11, 2024

Windows (Shared): omp_locks failed Symbol #3795

Closed

install with conda error: Symbol not found: __ZN5amrex6OpenMP9omp_locksE ECP-WarpX/impactx#550

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BaseFab::lockAdd: Faster version of BaseFab::atomicAdd for OpenMP #3696

BaseFab::lockAdd: Faster version of BaseFab::atomicAdd for OpenMP #3696

WeiqunZhang commented Jan 11, 2024 •

edited

Loading

WeiqunZhang commented Jan 12, 2024

ax3l Feb 3, 2024

WeiqunZhang Feb 3, 2024 •

edited

Loading

ax3l Feb 3, 2024

ax3l commented Feb 3, 2024 •

edited

Loading

ax3l commented Feb 3, 2024 •

edited

Loading

WeiqunZhang commented Feb 3, 2024

ax3l commented Feb 3, 2024

WeiqunZhang commented Feb 3, 2024

WeiqunZhang commented Feb 3, 2024

ax3l commented Feb 3, 2024 •

edited

Loading

ax3l commented Feb 3, 2024

WeiqunZhang commented Feb 3, 2024

ax3l commented Feb 3, 2024 •

edited

Loading

WeiqunZhang commented Feb 3, 2024

ax3l commented Feb 3, 2024

BaseFab::lockAdd: Faster version of BaseFab::atomicAdd for OpenMP #3696

BaseFab::lockAdd: Faster version of BaseFab::atomicAdd for OpenMP #3696

Conversation

WeiqunZhang commented Jan 11, 2024 • edited Loading

WeiqunZhang commented Jan 12, 2024

ax3l Feb 3, 2024

Choose a reason for hiding this comment

WeiqunZhang Feb 3, 2024 • edited Loading

Choose a reason for hiding this comment

ax3l Feb 3, 2024

Choose a reason for hiding this comment

ax3l commented Feb 3, 2024 • edited Loading

ax3l commented Feb 3, 2024 • edited Loading

WeiqunZhang commented Feb 3, 2024

ax3l commented Feb 3, 2024

WeiqunZhang commented Feb 3, 2024

WeiqunZhang commented Feb 3, 2024

ax3l commented Feb 3, 2024 • edited Loading

ax3l commented Feb 3, 2024

WeiqunZhang commented Feb 3, 2024

ax3l commented Feb 3, 2024 • edited Loading

WeiqunZhang commented Feb 3, 2024

ax3l commented Feb 3, 2024

WeiqunZhang commented Jan 11, 2024 •

edited

Loading

WeiqunZhang Feb 3, 2024 •

edited

Loading

ax3l commented Feb 3, 2024 •

edited

Loading

ax3l commented Feb 3, 2024 •

edited

Loading

ax3l commented Feb 3, 2024 •

edited

Loading

ax3l commented Feb 3, 2024 •

edited

Loading