Switching the threading component to C++20's atomic synchronization? #24

mreineck · 2024-02-12T08:15:19Z

As part of an experiment I recently switched from ducc's latch class to C++20's std::latch, and to my surprise I noticed a significant speedup when benchmarking the overhead when submitting work items to the thread pool.

Looking at the source code it sees that std::latch (at least the one from libstdc++) uses atomic synchronization (see, e.g. https://developers.redhat.com/articles/2022/12/06/implementing-c20-atomic-waiting-libstdc#). I wonder if it might be useful to switch the rest of ducc's multithreading component to this mechanism as well.

What's your opinion, @peterbell10 ?

The text was updated successfully, but these errors were encountered:

peterbell10 · 2024-02-15T22:28:32Z

That makes sense to me, std::latch is likely to be higher power overhead while waiting (i.e. less time in sleep) but when the waiting thread also has work to do then you don't spend that much time waiting so it's going to be worth it.

mreineck · 2024-02-16T08:27:45Z

Thanks! You are certainly right, any approach with (partial) busy waiting will require more power and is likely only useful in a scenario where each thread is guaranteed to have a dedicated core on which it runs (e.g. large non-interactive scientific calculations run on a batch system).
Since the whole thing would require switching ducc to C++20 it will probably not go into the main branch very soon, but it's nice to explore what can be achieved with it. My current version is on the crazy_threading branch where I

reduced your very general-purpose thread pool implementation to something minimalistic that basically serves as a launcher for OpenMP-like parallel regions
use custom signalling flags and latches to minimize latency

Using this I managed to reduce the time spent for executing an empty parallel region from roughly 30 microseconds to 1.6 microseconds on the 16 hardware threads of my laptop.
The whole thing was prompted by the sub-optimal performance of the ducc FFT in small 2D and 3D transforms shown at https://github.com/blackwer/fft_bench. The results look much better now (in my local tests), but I'm not sure if this warrants such a fundamental change.

mreineck · 2024-02-16T16:04:55Z

There is one problematic thing with the change: for some reason, the CI runs on MacOS become extremely slow and are cancelled before the tests can finish. I watched the run, and I'm fairly certain that there are no deadlocks; but still the whole test script slows down to a crawl (see, e.g. https://github.com/mreineck/ducc/actions/runs/7931217750). I have no idea what's causing this and whether it's a property of the testing environment...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switching the threading component to C++20's atomic synchronization? #24

Switching the threading component to C++20's atomic synchronization? #24

mreineck commented Feb 12, 2024

peterbell10 commented Feb 15, 2024 •

edited

Loading

mreineck commented Feb 16, 2024

mreineck commented Feb 16, 2024

Switching the threading component to C++20's atomic synchronization? #24

Switching the threading component to C++20's atomic synchronization? #24

Comments

mreineck commented Feb 12, 2024

peterbell10 commented Feb 15, 2024 • edited Loading

mreineck commented Feb 16, 2024

mreineck commented Feb 16, 2024

peterbell10 commented Feb 15, 2024 •

edited

Loading