Bug: false sharing in threadpool makes ggml_barrier() needlessly slow #9588

wtarreau · 2024-09-22T08:40:31Z

What happened?

I was surprised to see libgomp take most of the CPU in perf top on a 80-core ARM server with 6 DDR4 channels. I found that I could disable libgomp by building with GGML_NO_OPENMP, then the CPU was spent in ggml_barrier(), on a load. Looking closer, I found the cause: the barrier is implemented using two separate variables that are unfortunately in the same cache line. This means that all threads waiting on the last one are preventing all other threads from incrementing the thread count quickly, causing the cache line to bounce back-and-forth between all cores. In addition, all threads would perform an atomic write to the barrier_passed counter, while only one of them would write a non-zero value there, causing heavy serialization again.

Addressing only half of the issue at once obviously doesn't completely unlock the performance, however doing the two at once gives significant gains (+21% vs base, +3% vs openmp) on text generation once tested like this:

$ ./llama-cli --temp 0.1 -m ../models/Mistral-Small-Instruct-2409-Q5_K_M.gguf -p "write about anything" -n 40

The results are:

base (openmp): 7.04 token/s
GGML_NO_OPENMP=1: 5.98 token/s
GGML_NO_OPENMP=1 + fix: 7.27 token/s

On smaller models it's even more visible:

$ ./llama-cli --temp 0.1 -m ../models/mistral-7b-instruct-v0.2.Q5_K_M.gguf -p "write about anything" -n 100

base (openmp): 20.4 token/s
GGML_NO_OPENMP=1: 16.01 token/s
GGML_NO_OPENMP=1 + fix: 20.87 token/s (+30.3% and +2.3% respectively)

I'm not observing any relevant gain on x86 however, though the only x86 machines I have access to have few cores and see their performance limited by the DRAM bandwidth. But it might be likely to make a difference on some EPYC having multiple CCD.

I'm attaching the patch that addresses it here. I could send a PR if needed but it takes much more time and won't have time for this until next week-end, and I know that you don't mind applying patches once they're explained, Georgi.

0001-threadpool-avoid-false-sharing-between-n_barrier_pas.patch.txt

Name and Version

$ ./llama-cli --version
version: 3802 (a5b57b0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

Before (with openmp):
llama_perf_sampler_print:    sampling time =       2.01 ms /    44 runs   (    0.05 ms per token, 21857.92 tokens per second)
llama_perf_context_print:        load time =    2201.99 ms
llama_perf_context_print: prompt eval time =     229.43 ms /     4 tokens (   57.36 ms per token,    17.43 tokens per second)
llama_perf_context_print:        eval time =    5537.76 ms /    39 runs   (  141.99 ms per token,     7.04 tokens per second)
llama_perf_context_print:       total time =    5771.76 ms /    43 tokens


Before: (without openmp):
llama_perf_sampler_print:    sampling time =       2.03 ms /    44 runs   (    0.05 ms per token, 21664.20 tokens per second)
llama_perf_context_print:        load time =    2253.97 ms
llama_perf_context_print: prompt eval time =     258.66 ms /     4 tokens (   64.66 ms per token,    15.46 tokens per second)
llama_perf_context_print:        eval time =    6520.80 ms /    39 runs   (  167.20 ms per token,     5.98 tokens per second)
llama_perf_context_print:       total time =    6787.95 ms /    43 tokens


After (without openmp):
llama_perf_sampler_print:    sampling time =       2.06 ms /    44 runs   (    0.05 ms per token, 21369.60 tokens per second)
llama_perf_context_print:        load time =    2189.62 ms
llama_perf_context_print: prompt eval time =     231.70 ms /     4 tokens (   57.92 ms per token,    17.26 tokens per second)
llama_perf_context_print:        eval time =    5362.74 ms /    39 runs   (  137.51 ms per token,     7.27 tokens per second)
llama_perf_context_print:       total time =    5602.92 ms /    43 tokens

The text was updated successfully, but these errors were encountered:

max-krasnyansky · 2024-09-23T01:58:12Z

@wtarreau

Aha. Good to see other folks who care about the barrier performance :).
The timing cannot be more perfect. We just merged a major update to the threading.
Improving perf for machines with more than 16 cores was on the TODO list.
I was not very happy about those extra writes from waiting threads, they were added
primarily to keep thread sanitizer happy, and we did need a fence there anyway for
correctness. You showed a really good case why we need to resurrect some TSAN
hacks to avoid making normal case more expensive.

I started a new PR that includes your patch, and adds support for Windows and things.
#9598

wtarreau added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Sep 22, 2024

max-krasnyansky mentioned this issue Sep 23, 2024

threads: improve ggml_barrier scaling with large number of threads #9598

Merged

4 tasks

max-krasnyansky self-assigned this Sep 23, 2024

max-krasnyansky closed this as completed in #9598 Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: false sharing in threadpool makes ggml_barrier() needlessly slow #9588

Bug: false sharing in threadpool makes ggml_barrier() needlessly slow #9588

wtarreau commented Sep 22, 2024

max-krasnyansky commented Sep 23, 2024

Bug: false sharing in threadpool makes ggml_barrier() needlessly slow #9588

Bug: false sharing in threadpool makes ggml_barrier() needlessly slow #9588

Comments

wtarreau commented Sep 22, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

max-krasnyansky commented Sep 23, 2024