Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: false sharing in threadpool makes ggml_barrier() needlessly slow #9588

Closed
wtarreau opened this issue Sep 22, 2024 · 1 comment · Fixed by #9598
Closed

Bug: false sharing in threadpool makes ggml_barrier() needlessly slow #9588

wtarreau opened this issue Sep 22, 2024 · 1 comment · Fixed by #9598
Assignees
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)

Comments

@wtarreau
Copy link
Contributor

What happened?

I was surprised to see libgomp take most of the CPU in perf top on a 80-core ARM server with 6 DDR4 channels. I found that I could disable libgomp by building with GGML_NO_OPENMP, then the CPU was spent in ggml_barrier(), on a load. Looking closer, I found the cause: the barrier is implemented using two separate variables that are unfortunately in the same cache line. This means that all threads waiting on the last one are preventing all other threads from incrementing the thread count quickly, causing the cache line to bounce back-and-forth between all cores. In addition, all threads would perform an atomic write to the barrier_passed counter, while only one of them would write a non-zero value there, causing heavy serialization again.

Addressing only half of the issue at once obviously doesn't completely unlock the performance, however doing the two at once gives significant gains (+21% vs base, +3% vs openmp) on text generation once tested like this:

$ ./llama-cli --temp 0.1 -m ../models/Mistral-Small-Instruct-2409-Q5_K_M.gguf -p "write about anything" -n 40

The results are:

  • base (openmp): 7.04 token/s
  • GGML_NO_OPENMP=1: 5.98 token/s
  • GGML_NO_OPENMP=1 + fix: 7.27 token/s

On smaller models it's even more visible:

$ ./llama-cli --temp 0.1 -m ../models/mistral-7b-instruct-v0.2.Q5_K_M.gguf -p "write about anything" -n 100
  • base (openmp): 20.4 token/s
  • GGML_NO_OPENMP=1: 16.01 token/s
  • GGML_NO_OPENMP=1 + fix: 20.87 token/s (+30.3% and +2.3% respectively)

I'm not observing any relevant gain on x86 however, though the only x86 machines I have access to have few cores and see their performance limited by the DRAM bandwidth. But it might be likely to make a difference on some EPYC having multiple CCD.

I'm attaching the patch that addresses it here. I could send a PR if needed but it takes much more time and won't have time for this until next week-end, and I know that you don't mind applying patches once they're explained, Georgi.

0001-threadpool-avoid-false-sharing-between-n_barrier_pas.patch.txt

Name and Version

$ ./llama-cli --version
version: 3802 (a5b57b0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

Before (with openmp):
llama_perf_sampler_print:    sampling time =       2.01 ms /    44 runs   (    0.05 ms per token, 21857.92 tokens per second)
llama_perf_context_print:        load time =    2201.99 ms
llama_perf_context_print: prompt eval time =     229.43 ms /     4 tokens (   57.36 ms per token,    17.43 tokens per second)
llama_perf_context_print:        eval time =    5537.76 ms /    39 runs   (  141.99 ms per token,     7.04 tokens per second)
llama_perf_context_print:       total time =    5771.76 ms /    43 tokens


Before: (without openmp):
llama_perf_sampler_print:    sampling time =       2.03 ms /    44 runs   (    0.05 ms per token, 21664.20 tokens per second)
llama_perf_context_print:        load time =    2253.97 ms
llama_perf_context_print: prompt eval time =     258.66 ms /     4 tokens (   64.66 ms per token,    15.46 tokens per second)
llama_perf_context_print:        eval time =    6520.80 ms /    39 runs   (  167.20 ms per token,     5.98 tokens per second)
llama_perf_context_print:       total time =    6787.95 ms /    43 tokens


After (without openmp):
llama_perf_sampler_print:    sampling time =       2.06 ms /    44 runs   (    0.05 ms per token, 21369.60 tokens per second)
llama_perf_context_print:        load time =    2189.62 ms
llama_perf_context_print: prompt eval time =     231.70 ms /     4 tokens (   57.92 ms per token,    17.26 tokens per second)
llama_perf_context_print:        eval time =    5362.74 ms /    39 runs   (  137.51 ms per token,     7.27 tokens per second)
llama_perf_context_print:       total time =    5602.92 ms /    43 tokens
@wtarreau wtarreau added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Sep 22, 2024
@max-krasnyansky
Copy link
Collaborator

@wtarreau

Aha. Good to see other folks who care about the barrier performance :).
The timing cannot be more perfect. We just merged a major update to the threading.
Improving perf for machines with more than 16 cores was on the TODO list.
I was not very happy about those extra writes from waiting threads, they were added
primarily to keep thread sanitizer happy, and we did need a fence there anyway for
correctness. You showed a really good case why we need to resurrect some TSAN
hacks to avoid making normal case more expensive.

I started a new PR that includes your patch, and adds support for Windows and things.
#9598

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants