Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching the threading component to C++20's atomic synchronization? #24

Open
mreineck opened this issue Feb 12, 2024 · 3 comments
Open

Comments

@mreineck
Copy link
Owner

As part of an experiment I recently switched from ducc's latch class to C++20's std::latch, and to my surprise I noticed a significant speedup when benchmarking the overhead when submitting work items to the thread pool.

Looking at the source code it sees that std::latch (at least the one from libstdc++) uses atomic synchronization (see, e.g. https://developers.redhat.com/articles/2022/12/06/implementing-c20-atomic-waiting-libstdc#). I wonder if it might be useful to switch the rest of ducc's multithreading component to this mechanism as well.

What's your opinion, @peterbell10 ?

@peterbell10
Copy link
Contributor

peterbell10 commented Feb 15, 2024

That makes sense to me, std::latch is likely to be higher power overhead while waiting (i.e. less time in sleep) but when the waiting thread also has work to do then you don't spend that much time waiting so it's going to be worth it.

@mreineck
Copy link
Owner Author

Thanks! You are certainly right, any approach with (partial) busy waiting will require more power and is likely only useful in a scenario where each thread is guaranteed to have a dedicated core on which it runs (e.g. large non-interactive scientific calculations run on a batch system).
Since the whole thing would require switching ducc to C++20 it will probably not go into the main branch very soon, but it's nice to explore what can be achieved with it. My current version is on the crazy_threading branch where I

  • reduced your very general-purpose thread pool implementation to something minimalistic that basically serves as a launcher for OpenMP-like parallel regions
  • use custom signalling flags and latches to minimize latency

Using this I managed to reduce the time spent for executing an empty parallel region from roughly 30 microseconds to 1.6 microseconds on the 16 hardware threads of my laptop.
The whole thing was prompted by the sub-optimal performance of the ducc FFT in small 2D and 3D transforms shown at https://github.com/blackwer/fft_bench. The results look much better now (in my local tests), but I'm not sure if this warrants such a fundamental change.

@mreineck
Copy link
Owner Author

There is one problematic thing with the change: for some reason, the CI runs on MacOS become extremely slow and are cancelled before the tests can finish. I watched the run, and I'm fairly certain that there are no deadlocks; but still the whole test script slows down to a crawl (see, e.g. https://github.com/mreineck/ducc/actions/runs/7931217750). I have no idea what's causing this and whether it's a property of the testing environment...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants