Bug: false sharing in threadpool makes ggml_barrier() needlessly slow #9588
Labels
bug-unconfirmed
low severity
Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
What happened?
I was surprised to see libgomp take most of the CPU in perf top on a 80-core ARM server with 6 DDR4 channels. I found that I could disable libgomp by building with GGML_NO_OPENMP, then the CPU was spent in ggml_barrier(), on a load. Looking closer, I found the cause: the barrier is implemented using two separate variables that are unfortunately in the same cache line. This means that all threads waiting on the last one are preventing all other threads from incrementing the thread count quickly, causing the cache line to bounce back-and-forth between all cores. In addition, all threads would perform an atomic write to the barrier_passed counter, while only one of them would write a non-zero value there, causing heavy serialization again.
Addressing only half of the issue at once obviously doesn't completely unlock the performance, however doing the two at once gives significant gains (+21% vs base, +3% vs openmp) on text generation once tested like this:
The results are:
On smaller models it's even more visible:
I'm not observing any relevant gain on x86 however, though the only x86 machines I have access to have few cores and see their performance limited by the DRAM bandwidth. But it might be likely to make a difference on some EPYC having multiple CCD.
I'm attaching the patch that addresses it here. I could send a PR if needed but it takes much more time and won't have time for this until next week-end, and I know that you don't mind applying patches once they're explained, Georgi.
0001-threadpool-avoid-false-sharing-between-n_barrier_pas.patch.txt
Name and Version
$ ./llama-cli --version
version: 3802 (a5b57b0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
The text was updated successfully, but these errors were encountered: