Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threadpool: take 2 #8672

Merged
merged 48 commits into from
Aug 29, 2024
Merged

Threadpool: take 2 #8672

merged 48 commits into from
Aug 29, 2024

Conversation

fmz
Copy link
Contributor

@fmz fmz commented Jul 24, 2024

ref: original PR #7526

Added an API to support explicit management and fine-grain control of threadpools.
The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.

Each threadpool supports:

Setting number of threads (duh)
Setting a CPU mask for threads to be placed on
Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide
Support for polling/interrupt-driven wait
Setting thread priority
Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior).
If users choose to explicitly use threadpools, they have to manage them manually. See example in main.cpp.

With all the bells and whistles enabled, we generally see a minor improvement vs OMP. Without polling, threadpool runs on par with OMP.

@github-actions github-actions bot added testing Everything test related examples server ggml changes relating to the ggml tensor library for machine learning labels Jul 24, 2024
@fmz
Copy link
Contributor Author

fmz commented Jul 24, 2024

Here are some perf figures:

On W-2225 Xeon machine: CPU backend:

CPU Model Test t/s master t/s threadpool Speedup
Intel(R) Xeon(R) W-2225 CPU @ 4.10GHz llama 7B Q4_0 pp512 17.46 17.51 1.00
Intel(R) Xeon(R) W-2225 CPU @ 4.10GHz llama 7B Q4_0 tg128 6.98 7.06 1.01

Intel 10th-gen CPU:
./scripts/compare-commits.sh master threadpool -t 1,2,4,6,8,10

CPU Model Threads Test t/s master t/s threadpool-attempt-2 Speedup
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 1 pp512 3.93 3.94 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 1 tg128 2.43 2.44 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 2 pp512 7.13 7.06 0.99
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 2 tg128 4.37 4.36 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 4 pp512 11.96 11.99 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 4 tg128 6.79 6.77 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 6 pp512 14.96 14.98 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 6 tg128 7.51 7.53 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 8 pp512 13.06 13.09 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 8 tg128 6.88 6.83 0.99
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 10 pp512 14.08 14.06 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 10 tg128 7.49 7.52 1.00

Mobile NVIDIA 3060:
$ LLAMA_CUDA=1 ./scripts/compare-commits.sh master threadpool -nkvo 0,1

GPU Model NKVO Test t/s master t/s threadpool-attempt-2 Speedup
RTX 3060 Laptop GPU llama 7B Q4_0 No pp512 1644.73 1642.34 1.00
RTX 3060 Laptop GPU llama 7B Q4_0 No tg128 65.94 65.89 1.00
RTX 3060 Laptop GPU llama 7B Q4_0 Yes pp512 287.28 286.44 1.00
RTX 3060 Laptop GPU llama 7B Q4_0 Yes tg128 54.56 54.32 1.00

@fmz fmz force-pushed the threadpool branch 3 times, most recently from 043b9df to ef1ff14 Compare July 25, 2024 14:50
@fmz
Copy link
Contributor Author

fmz commented Jul 26, 2024

@slaren Threadpool is back!
Updated it a bit to be aligned with the latest graph-compute design. The current performance is largely on par with OpenMP.
Please lmk if you have any comments/suggestions?

@slaren
Copy link
Collaborator

slaren commented Jul 26, 2024

I tried to test this on macOS, but it seems to deadlock.

WARNING: ThreadSanitizer: data race (pid=62377)
  Write of size 1 at 0x00010ab02a8e by main thread:
    #0 ggml_graph_compute ggml.c:19365 (llama-bench:arm64+0x10003fb54)
    #1 ggml_backend_cpu_graph_compute ggml-backend.c:822 (llama-bench:arm64+0x1000a5f1c)
    #2 ggml_backend_graph_compute_async ggml-backend.c:282 (llama-bench:arm64+0x10009bac0)
    #3 ggml_backend_sched_compute_splits ggml-backend.c:1795 (llama-bench:arm64+0x1000a3190)
    #4 ggml_backend_sched_graph_compute_async ggml-backend.c:1979 (llama-bench:arm64+0x1000a2d24)
    #5 llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_compute_threadpool*) llama.cpp:14412 (llama-bench:arm64+0x100292cac)
    #6 llama_decode_internal(llama_context&, llama_batch) llama.cpp:14666 (llama-bench:arm64+0x1000fda4c)
    #7 llama_decode llama.cpp:18489 (llama-bench:arm64+0x1000fc460)
    #8 test_prompt(llama_context*, int, int, int, int) llama-bench.cpp:1319 (llama-bench:arm64+0x10062bd5c)
    #9 main llama-bench.cpp:1454 (llama-bench:arm64+0x100627180)

  Previous read of size 1 at 0x00010ab02a8e by thread T12 (mutexes: write M0):
    #0 ggml_graph_compute_check_for_work ggml.c:19152 (llama-bench:arm64+0x100053a10)
    #1 ggml_graph_compute_secondary_thread ggml.c:19189 (llama-bench:arm64+0x1000537dc)

  Location is heap block of size 192 at 0x00010ab02a00 allocated by main thread:
    #0 posix_memalign <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x564c0)
    #1 ggml_aligned_malloc ggml.c:241 (llama-bench:arm64+0x10001ac88)
    #2 ggml_create_threadpool_impl ggml.c:19214 (llama-bench:arm64+0x10003f14c)
    #3 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
    #4 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)

  Mutex M0 (0x00010ab02a00) created at:
    #0 pthread_mutex_init <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x31470)
    #1 ggml_create_threadpool_impl ggml.c:19238 (llama-bench:arm64+0x10003f404)
    #2 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
    #3 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)

  Thread T12 (tid=36579987, running) created by main thread at:
    #0 pthread_create <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x3062c)
    #1 ggml_create_threadpool_impl ggml.c:19277 (llama-bench:arm64+0x10003f638)
    #2 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
    #3 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)

SUMMARY: ThreadSanitizer: data race ggml.c:19365 in ggml_graph_compute
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x000000012481fc00) at ggml.c:19132:5
    frame #2: 0x0000000104ba17ec llama-bench`ggml_graph_compute(cgraph=0x00000001182901b8, cplan=0x000000016b28a730) at ggml.c:19373:5
    frame #3: 0x0000000104be8400 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000039680b0, cgraph=0x00000001182901b8) at ggml-backend.c:822:12
    frame #4: 0x0000000104be23c4 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000039680b0, cgraph=0x00000001182901b8) at ggml-backend.c:282:12
    frame #5: 0x0000000104be6834 llama-bench`ggml_backend_sched_compute_splits(sched=0x0000000115000000) at ggml-backend.c:1795:35
    frame #6: 0x0000000104be65a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x0000000115000000, graph=0x0000000118420020) at ggml-backend.c:1979:12
    frame #7: 0x0000000104d09f44 llama-bench`llama_graph_compute(lctx=0x0000000114813e00, gf=0x0000000118420020, n_threads=12, threadpool=0x0000600003b6c3c0) at llama.cpp:14412:5
    frame #8: 0x0000000104c2b148 llama-bench`llama_decode_internal(lctx=0x0000000114813e00, batch_all=llama_batch @ 0x000000016b28ac60) at llama.cpp:14666:9
    frame #9: 0x0000000104c2a15c llama-bench`llama_decode(ctx=0x0000000114813e00, batch=llama_batch @ 0x000000016b28ad08) at llama.cpp:18489:21
    frame #10: 0x0000000104f3ecbc llama-bench`test_prompt(ctx=0x0000000114813e00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x0000000104f3ae44 llama-bench`main(argc=9, argv=0x000000016b28b940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x000000012481fe20) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x000000012481fe20) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820040) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820040) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820260) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820260) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820480) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820480) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x00000001248206a0) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248206a0) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248208c0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248208c0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820ae0) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820ae0) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124820d00) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820d00) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820f20) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820f20) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124821140) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821140) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124821360) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821360) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124821580) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821580) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248217a0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248217a0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248219c0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248219c0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124821be0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821be0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

@fmz
Copy link
Contributor Author

fmz commented Jul 26, 2024

I tried to test this on macOS, but it seems to deadlock.

Fixed!

@fmz
Copy link
Contributor Author

fmz commented Jul 26, 2024

On M2 Max: (GGML_NO_METAL=1 GGML_NO_ACCELERATE=1)

CPU Model Threads Test t/s master t/s threadpool Speedup
llama 7B Q4_0 4 pp512 32.97 34.87 1.06
llama 7B Q4_0 4 tg128 18.01 18.37 1.02
llama 7B Q4_0 6 pp512 47.43 48.99 1.03
llama 7B Q4_0 6 tg128 23.10 23.32 1.01
llama 7B Q4_0 8 pp512 49.90 55.17 1.11
llama 7B Q4_0 8 tg128 18.09 21.98 1.22
llama 7B Q4_0 10 pp512 52.50 56.69 1.08
llama 7B Q4_0 10 tg128 14.24 8.54 0.60
llama 7B Q4_0 12 pp512 56.37 56.93 1.01
llama 7B Q4_0 12 tg128 5.02 9.44 1.88

@fmz
Copy link
Contributor Author

fmz commented Jul 26, 2024

Same thing, but with llama-v3 8B Q4_0_4_4 (for some reason my compiler AppleClang15 doesn't support INT8 matmul?)

CPU Model Threads Test t/s master t/s threadpool Speedup
llama 8B Q4_0_4_4 4 pp512 72.44 72.83 1.01
llama 8B Q4_0_4_4 4 tg128 22.29 23.50 1.05
llama 8B Q4_0_4_4 6 pp512 98.71 100.21 1.02
llama 8B Q4_0_4_4 6 tg128 24.63 24.44 0.99
llama 8B Q4_0_4_4 8 pp512 95.86 116.17 1.21
llama 8B Q4_0_4_4 8 tg128 21.19 26.28 1.24
llama 8B Q4_0_4_4 10 pp512 102.37 105.18 1.03
llama 8B Q4_0_4_4 10 tg128 18.63 16.98 0.91
llama 8B Q4_0_4_4 12 pp512 108.08 101.18 0.94
llama 8B Q4_0_4_4 12 tg128 6.22 11.39 1.83

@oldgithubman

This comment was marked as spam.

@fmz
Copy link
Contributor Author

fmz commented Jul 29, 2024

@slaren lmk if it works for you this time

@slaren
Copy link
Collaborator

slaren commented Jul 31, 2024

I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
    frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
    frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
    frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
    frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
    frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
    frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
    frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
    frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
    frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.

@fmz
Copy link
Contributor Author

fmz commented Jul 31, 2024

I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
    frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
    frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
    frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
    frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
    frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
    frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
    frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
    frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
    frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.

Bummer...
Thanks for the details!
Looks like we got some trouble in the "ACCELERATE" path
I'll fix it asap

@fmz
Copy link
Contributor Author

fmz commented Jul 31, 2024

I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
    frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
    frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
    frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
    frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
    frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
    frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
    frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
    frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
    frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.

Bummer... Thanks for the details! Looks like we got some trouble in the "ACCELERATE" path I'll fix it asap

@slaren turns out there was a bit of a corner case where if you have a graph with only 1 node, ggml_barrier and wait_for_work deadlock on each other.
Added a check to handle that specific case

@slaren
Copy link
Collaborator

slaren commented Aug 1, 2024

Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A threadpool would be more important on macOS, since OpenMP is not available there, but for me it is also slower on the M3 Max.

M3 Max:
GGML_NO_METAL=1 scripts/compare-commits.sh master threadpool -m models/llama-2-7b/ggml-model-Q4_0.gguf

CPU Model Model Size [GiB] Test t/s master t/s threadpool Speedup
M3 Max llama 7B Q4_0 3.56 pp512 151.21 149.88 0.99
M3 Max llama 7B Q4_0 3.56 tg128 30.06 26.09 0.87

13900k + 3090Ti:
OpenMP (GGML_CUDA=1 make llama-bench && ./llama-bench -nkvo 0,1)

model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5699.53 ± 19.73
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 150.75 ± 1.23
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 651.63 ± 32.31
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 63.85 ± 3.22

Threadpool (GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench && ./llama-bench -nkvo 0,1)

model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5453.33 ± 216.72
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 144.45 ± 0.98
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 566.43 ± 27.64
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 29.54 ± 0.99

build: bebe99c (3500)

@fmz
Copy link
Contributor Author

fmz commented Aug 1, 2024

Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A threadpool would be more important on macOS, since OpenMP is not available there, but for me it is also slower on the M3 Max.

M3 Max: GGML_NO_METAL=1 scripts/compare-commits.sh master threadpool -m models/llama-2-7b/ggml-model-Q4_0.gguf

CPU Model Model Size [GiB] Test t/s master t/s threadpool Speedup
M3 Max llama 7B Q4_0 3.56 pp512 151.21 149.88 0.99
M3 Max llama 7B Q4_0 3.56 tg128 30.06 26.09 0.87
13900k + 3090Ti: OpenMP (GGML_CUDA=1 make llama-bench && ./llama-bench -nkvo 0,1)

model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5699.53 ± 19.73
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 150.75 ± 1.23
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 651.63 ± 32.31
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 63.85 ± 3.22
Threadpool (GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench && ./llama-bench -nkvo 0,1)

model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5453.33 ± 216.72
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 144.45 ± 0.98
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 566.43 ± 27.64
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 29.54 ± 0.99
build: bebe99c (3500)

Ooof...
That is quite a bit slower. I'll try to replicate this locally

@max-krasnyansky
Copy link
Collaborator

max-krasnyansky commented Aug 3, 2024

@fmz @slaren
I fixed one of the issues that was causing regressions. We were setting the default number of threads in the threadpool using std::thread::hardware_concurrency(). I updated that to use cpu_get_num_math() this way we are going to exclude E-Cores and Hypethreading siblings.
This is what was causing regressions with the default cmd line args where the number of threads is not explicitly specified.
We were starting 12 threads on M2 Max, where only 8 cores are really usable, same on AMD EPYC (using siblings) and Intel 13/14th Gen (using E-Cores).

I'm also working on another fix which is specific to llama-bench. Currently (in the threadpool branch) we start a single threadpool with max-num-threads and reuse it for each test. Suppose the test is using 4 threads but we'd start 12 (on M2 Max or Snapdragon X-Elite).
This is suboptimal because the spinning threads interfere with Core boosting and things. It's better to start a fresh threadpool for each test.

@max-krasnyansky
Copy link
Collaborator

@fmz @slaren
llama-bench has been updated as I described above.

Here are the numbers from M2 Max.
I'll share numbers for an AMD EPYC server, Snapdragon X-Elite and Gen-3 a bit later.

CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" GGML_NO_METAL=1 GGML_NO_ACCELERATE=1 \
    ./scripts/compare-commits.sh master threadpool -m ../gguf/llama-v3.1.q4_0_4_8.gguf -ngl 0 -t 4,6,8
...
+ ./scripts/compare-llama-bench.py -b master -c threadpool
| CPU   | Model             |   Threads | Test   |   t/s master |   t/s threadpool |   Speedup |
|:------|:------------------|----------:|:-------|-------------:|-----------------:|----------:|
|       | llama 8B Q4_0_4_8 |         4 | pp512  |        64.43 |            64.52 |      1.00 |
|       | llama 8B Q4_0_4_8 |         4 | tg128  |        22.53 |            24.36 |      1.08 |
|       | llama 8B Q4_0_4_8 |         6 | pp512  |        89.79 |            91.04 |      1.01 |
|       | llama 8B Q4_0_4_8 |         6 | tg128  |        24.73 |            26.21 |      1.06 |
|       | llama 8B Q4_0_4_8 |         8 | pp512  |       117.14 |           118.67 |      1.01 |
|       | llama 8B Q4_0_4_8 |         8 | tg128  |        26.11 |            26.37 |      1.01 |

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Aug 6, 2024
@fmz fmz force-pushed the threadpool branch 2 times, most recently from 4aa7a72 to 8ecdd36 Compare August 7, 2024 14:38
@slaren
Copy link
Collaborator

slaren commented Aug 8, 2024

The performance looks better now, with nkvo it is comparable to OpenMP, which is very good. There is still a performance drop when using the BLAS backend (this includes the default build in macOS, which uses Accelerate). I suspect that this is because the threads are spinning in ggml_graph_compute_check_for_work while the BLAS backend is running. This will also cause the threads to spin while the GPU backend is running when partially offloading, which would be a reggression. Rather than requiring the user to disable polling manually, I suggest implementing some kind of backoff and yield the threads after spinning for a while. The BLAS backend (ggml-blas.cpp) may also benefit from using the threadpool, since it launches several threads to dequantize the weights, and it could also automatically pause the pool during the call to the BLAS library.

Results

GGML_CUDA=1 make llama-bench > /dev/null && ./llama-bench -nkvo 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5689.32 ± 13.35
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 154.53 ± 1.04
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 643.28 ± 31.69
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 64.27 ± 2.21

build: 267bf57 (3554)

GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench > /dev/null && ./llama-bench -nkvo 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5674.51 ± 37.77
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 153.30 ± 0.48
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 646.42 ± 32.41
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 62.98 ± 2.94

build: 267bf57 (3554)

GGML_BLIS=1 make llama-bench > /dev/null && ./llama-bench -p 128 -n 32

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B BLAS 16 pp128 47.55 ± 0.17
llama 7B Q4_0 3.56 GiB 6.74 B BLAS 16 tg32 20.79 ± 0.10

build: 267bf57 (3554)

GGML_BLIS=1 GGML_NO_OPENMP=1 make llama-bench > /dev/null && ./llama-bench -p 128 -n 32

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B BLAS 16 pp128 33.47 ± 0.48
llama 7B Q4_0 3.56 GiB 6.74 B BLAS 16 tg32 20.58 ± 0.07

build: 267bf57 (3554)

CPU Model Threads Test t/s master t/s threadpool Speedup
M3 Max llama 7B all F32 4 pp512 150.03 134.77 0.90
M3 Max llama 7B all F32 4 tg128 4.76 4.20 0.88
M3 Max llama 7B all F32 8 pp512 155.66 115.40 0.74
M3 Max llama 7B all F32 8 tg128 4.76 4.35 0.91
M3 Max llama 7B all F32 12 pp512 156.19 94.43 0.60
M3 Max llama 7B all F32 12 tg128 4.66 4.33 0.93
M3 Max llama 7B Q4_0 4 pp512 142.43 144.89 1.02
M3 Max llama 7B Q4_0 4 tg128 21.04 20.74 0.99
M3 Max llama 7B Q4_0 8 pp512 150.08 142.22 0.95
M3 Max llama 7B Q4_0 8 tg128 28.22 28.14 1.00
M3 Max llama 7B Q4_0 12 pp512 150.55 120.62 0.80
M3 Max llama 7B Q4_0 12 tg128 30.10 30.26 1.01
M3 Max stories260K 4 pp512 52491.62 65492.68 1.25
M3 Max stories260K 4 tg128 8417.80 12262.68 1.46
M3 Max stories260K 8 pp512 59893.07 94300.47 1.57
M3 Max stories260K 8 tg128 3746.70 5639.87 1.51
M3 Max stories260K 12 pp512 53756.90 115958.90 2.16
M3 Max stories260K 12 tg128 2507.28 4333.34 1.73

@max-krasnyansky
Copy link
Collaborator

max-krasnyansky commented Aug 9, 2024

The performance looks better now, with nkvo it is comparable to OpenMP, which is very good. There is still a performance drop when using the BLAS backend (this includes the default build in macOS, which uses Accelerate). I suspect that this is because the threads are spinning in ggml_graph_compute_check_for_work while the BLAS backend is running. This will also cause the threads to spin while the GPU backend is running when partially offloading, which would be a reggression. Rather than requiring the user to disable polling manually, I suggest implementing some kind of backoff and yield the threads after spinning for a while. The BLAS backend (ggml-blas.cpp) may also benefit from using the threadpool, since it launches several threads to dequantize the weights, and it could also automatically pause the pool during the call to the BLAS library.

@slaren

Awesome! Thanks for checking out the latest. We've been doing lots of profiling and tuning.
Every time I'm about to send an updated perf report on Snapdragons and M2 I find yet another thing to improve :)
In my testing we're doing really well with the CPU backend (especially on the ARM64-based systems), with other backends, as you pointed out, the spinning threads get in the way at times and cause regressions.
I'll try your suggestions.

btw We might just flip the default back to non-polling. Technically polling is only useful for the llama-bench to match OpenMP behavior/numbers in that case. When I looked at the original profiles, I saw that the threadpool is doing a lot more context switches than OpenMP during token-gen test. Polling removes those context switches and we get even better numbers now.
It might make sense to make that a bit of a special case (ie default to polling for the CPU backend bench, otherwise default is non-polling) or some hybrid approach as you suggested.

All threadpool related functions and structs use ggml_threadpool prefix.
@max-krasnyansky
Copy link
Collaborator

@slaren
Most of your comments & suggestions have been addressed.
GGML API has been further cleaned up and simplified. Theadpool switching is now transparent, we switch on ggml_backend_cpu_set_threadpool(). Threadpool params have nice defaults.

Process priority setting has been moved into a helper function in common/common.cpp and only called from the sample apps.

src/llama.cpp looks much simpler now. Just a few lines of extra code that selects the threadpool based on the
number of tokens, same as selecting n_threads. And a couple of API calls to attach threadpools, those are
just passthrough (ie used to pass threadpool to ggml_graph_compute())

As I mentioned above llama-bench needs to explicitly manage threadpool creation because cpu-mask and things are now vectors (ie test params) as you suggested earlier. I included some examples of the output above. It's really neat how it can be used to figure out the best CPU pinning. (unfortunately, this breaks compare-commits.sh for now because I added extra fields and sql tables don't match between branches).

Performance looks pretty good across the board, see the report below (llama-v2-115M on key platforms).
It looks like partial offload case (-ngl 10) on M2 Max is doing OK now.

We can iterate further on the automatic threadpool creation and reuse. I suggest we do that in Threadpool-V3 though, after we factor out thread/cpu/numa stuff into ggml-thread.cpp.

M2 Max

CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" make -j

(venv) ~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v2-115m.q4_0.gguf -t 4 -r 20 -ngl 0,10,99

model size params backend ngl threads test t/s
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 pp512 9421.03 ± 97.37
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 tg128 817.88 ± 2.58
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 pp512 46534.56 ± 1708.85
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 tg128 1167.67 ± 8.06
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 pp512 46993.95 ± 1904.04
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 tg128 1169.23 ± 9.39

build: 3246fe8 (3637)

(venv) ~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v2-115m.q4_0.gguf -t 4 -r 20 -ngl 0,10,99

model size params backend ngl threads test t/s
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 pp512 9543.05 ± 50.75
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 tg128 1003.64 ± 4.13
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 pp512 47665.38 ± 1765.27
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 tg128 1165.35 ± 8.50
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 pp512 46802.22 ± 2089.11
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 tg128 1162.56 ± 6.41

build: c6328bc (3677)

Ryzen 9 3950X + RTX 3080

GGML_CUDA=1 make -j
llama.cpp-master$ ./llama-bench -m ../gguf/llama-v2-115m.q4_k_m.gguf -t 8 -ngl 0,99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model size params backend ngl threads test t/s
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 pp512 44574.85 ± 218.52
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 tg128 811.77 ± 4.67
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 pp512 144896.09 ± 446.62
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 tg128 1862.24 ± 56.18

build: 3246fe8 (3637)

GGML_CUDA=1 GGML_NO_OPENMP=1 make -j
llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v2-115m.q4_k_m.gguf -t 8 -ngl 0,99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model size params backend ngl threads test t/s
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 pp512 44386.72 ± 184.30
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 tg128 816.19 ± 3.35
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 pp512 144243.73 ± 363.10
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 tg128 1904.55 ± 64.01

build: c6328bc (3677)

Snapdragon X-Elite

~/src/llama.cpp-master
$ ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v2-115m.q4_0_4_8.gguf -t 4

model size params backend threads test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 pp512 5345.36 ± 32.45
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 tg128 743.98 ± 26.26

build: 3246fe8 (3637)

~/src/llama.cpp-threadpool
$ ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v2-115m.q4_0_4_8.gguf -t 4

model size params backend threads test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 pp512 5457.88 ± 4.70
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 tg128 1006.58 ± 7.88

build: c6328bc (3677)

Snapdragon Gen 3

Default Android build: armv8.7-a + openmp
$ adb shell "cd /data/local/tmp/lmcp; simpleperf stat ./run-bench.sh master llama-v2-115m.q4_0_4_8.gguf -t 6"
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/master'
./master/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v2-115m.q4_0_4_8.gguf -t 6

model size params backend threads mmap test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 pp512 3099.16 ± 2.12
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 tg128 614.70 ± 115.46

build: 3246fe8 (3637)

Performance counter statistics:

#            count  event_name                # count / runtime
    38,854,765,094  cpu-cycles                # 3.014607 GHz      
       357,578,565  stalled-cycles-frontend   # 27.743 M/sec      
    16,329,331,107  stalled-cycles-backend    # 1.267 G/sec       
    85,859,994,043  instructions              # 6.662 G/sec       
        11,675,349  branch-misses             # 905.850 K/sec     
  12888.605049(ms)  task-clock                # 5.478707 cpus used
               934  context-switches          # 72.467 /sec       
             8,267  page-faults               # 641.419 /sec      

Default Android build: armv8.7-a + no-openmp
$ adb shell "cd /data/local/tmp/lmcp; simpleperf stat ./run-bench.sh threadpool llama-v2-115m.q4_0_4_8.gguf -t 6"
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/threadpool'
./threadpool/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v2-115m.q4_0_4_8.gguf -t 6

model size params backend threads mmap test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 pp512 3108.87 ± 5.66
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 tg128 750.33 ± 11.71

build: c6328bc (3677)

Performance counter statistics:

#            count  event_name                # count / runtime
    34,793,228,120  cpu-cycles                # 3.012136 GHz      
       325,677,417  stalled-cycles-frontend   # 28.195 M/sec      
    12,701,547,441  stalled-cycles-backend    # 1.100 G/sec       
    80,139,234,122  instructions              # 6.938 G/sec       
         8,967,312  branch-misses             # 776.323 K/sec     
  11550.896083(ms)  task-clock                # 5.611216 cpus used
               226  context-switches          # 19.566 /sec       
             7,976  page-faults               # 690.509 /sec      

examples/llama-bench/llama-bench.cpp Outdated Show resolved Hide resolved
Comment on lines +630 to +635
enum ggml_sched_priority {
GGML_SCHED_PRIO_NORMAL,
GGML_SCHED_PRIO_MEDIUM,
GGML_SCHED_PRIO_HIGH,
GGML_SCHED_PRIO_REALTIME
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need to be done now, but it would be useful to have priorities below normal. I don't expect that increasing the priority of compute threads will be very useful outside of benchmarking, virtually every other thread is more important.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Main use-cases we wanted to enable are benchmarking, low-latency LLM response (ie using fewer cores but having the threads quickly get CPU cycles), also bumping priority a bit encourages Windows scheduler to place threads on the perf cores.
Will add lower priorities in threadpool V3.

ggml/src/ggml-backend.c Outdated Show resolved Hide resolved
ggml/src/ggml.c Outdated Show resolved Hide resolved
include/llama.h Outdated Show resolved Hide resolved
include/llama.h Outdated Show resolved Hide resolved
ggml/include/ggml.h Outdated Show resolved Hide resolved
@max-krasnyansky
Copy link
Collaborator

@mofosyne @ggerganov @slaren
Should be good to go now.

@slaren slaren merged commit 42c76d1 into ggerganov:master Aug 29, 2024
52 checks passed
@slaren
Copy link
Collaborator

slaren commented Aug 29, 2024

Good job!

@max-krasnyansky
Copy link
Collaborator

Good job!

Thank you thank you!
Super fun discussions. Thanks for you patience with reviews and testing.
I'm going to get started on the V3 :) std::thread, std::atomic will make the code even better.

@ggerganov
Copy link
Owner

Thank you for the great work and thorough review 👍

@FranzKafkaYu
Copy link

@slaren Most of your comments & suggestions have been addressed. GGML API has been further cleaned up and simplified. Theadpool switching is now transparent, we switch on ggml_backend_cpu_set_threadpool(). Threadpool params have nice defaults.

Process priority setting has been moved into a helper function in common/common.cpp and only called from the sample apps.

src/llama.cpp looks much simpler now. Just a few lines of extra code that selects the threadpool based on the number of tokens, same as selecting n_threads. And a couple of API calls to attach threadpools, those are just passthrough (ie used to pass threadpool to ggml_graph_compute())

As I mentioned above llama-bench needs to explicitly manage threadpool creation because cpu-mask and things are now vectors (ie test params) as you suggested earlier. I included some examples of the output above. It's really neat how it can be used to figure out the best CPU pinning. (unfortunately, this breaks compare-commits.sh for now because I added extra fields and sql tables don't match between branches).

Performance looks pretty good across the board, see the report below (llama-v2-115M on key platforms). It looks like partial offload case (-ngl 10) on M2 Max is doing OK now.

We can iterate further on the automatic threadpool creation and reuse. I suggest we do that in Threadpool-V3 though, after we factor out thread/cpu/numa stuff into ggml-thread.cpp.

M2 Max

CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" make -j

(venv) ~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v2-115m.q4_0.gguf -t 4 -r 20 -ngl 0,10,99

model size params backend ngl threads test t/s
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 pp512 9421.03 ± 97.37
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 tg128 817.88 ± 2.58
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 pp512 46534.56 ± 1708.85
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 tg128 1167.67 ± 8.06
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 pp512 46993.95 ± 1904.04
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 tg128 1169.23 ± 9.39
build: 3246fe8 (3637)

(venv) ~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v2-115m.q4_0.gguf -t 4 -r 20 -ngl 0,10,99

model size params backend ngl threads test t/s
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 pp512 9543.05 ± 50.75
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 tg128 1003.64 ± 4.13
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 pp512 47665.38 ± 1765.27
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 tg128 1165.35 ± 8.50
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 pp512 46802.22 ± 2089.11
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 tg128 1162.56 ± 6.41
build: c6328bc (3677)

Ryzen 9 3950X + RTX 3080

GGML_CUDA=1 make -j llama.cpp-master$ ./llama-bench -m ../gguf/llama-v2-115m.q4_k_m.gguf -t 8 -ngl 0,99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model size params backend ngl threads test t/s
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 pp512 44574.85 ± 218.52
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 tg128 811.77 ± 4.67
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 pp512 144896.09 ± 446.62
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 tg128 1862.24 ± 56.18
build: 3246fe8 (3637)

GGML_CUDA=1 GGML_NO_OPENMP=1 make -j llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v2-115m.q4_k_m.gguf -t 8 -ngl 0,99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model size params backend ngl threads test t/s
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 pp512 44386.72 ± 184.30
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 tg128 816.19 ± 3.35
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 pp512 144243.73 ± 363.10
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 tg128 1904.55 ± 64.01
build: c6328bc (3677)

Snapdragon X-Elite

~/src/llama.cpp-master $ ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v2-115m.q4_0_4_8.gguf -t 4

model size params backend threads test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 pp512 5345.36 ± 32.45
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 tg128 743.98 ± 26.26
build: 3246fe8 (3637)

~/src/llama.cpp-threadpool $ ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v2-115m.q4_0_4_8.gguf -t 4

model size params backend threads test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 pp512 5457.88 ± 4.70
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 tg128 1006.58 ± 7.88
build: c6328bc (3677)

Snapdragon Gen 3

Default Android build: armv8.7-a + openmp $ adb shell "cd /data/local/tmp/lmcp; simpleperf stat ./run-bench.sh master llama-v2-115m.q4_0_4_8.gguf -t 6" export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/master' ./master/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v2-115m.q4_0_4_8.gguf -t 6

model size params backend threads mmap test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 pp512 3099.16 ± 2.12
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 tg128 614.70 ± 115.46
build: 3246fe8 (3637)

Performance counter statistics:

#            count  event_name                # count / runtime
    38,854,765,094  cpu-cycles                # 3.014607 GHz      
       357,578,565  stalled-cycles-frontend   # 27.743 M/sec      
    16,329,331,107  stalled-cycles-backend    # 1.267 G/sec       
    85,859,994,043  instructions              # 6.662 G/sec       
        11,675,349  branch-misses             # 905.850 K/sec     
  12888.605049(ms)  task-clock                # 5.478707 cpus used
               934  context-switches          # 72.467 /sec       
             8,267  page-faults               # 641.419 /sec      

Default Android build: armv8.7-a + no-openmp $ adb shell "cd /data/local/tmp/lmcp; simpleperf stat ./run-bench.sh threadpool llama-v2-115m.q4_0_4_8.gguf -t 6" export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/threadpool' ./threadpool/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v2-115m.q4_0_4_8.gguf -t 6

model size params backend threads mmap test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 pp512 3108.87 ± 5.66
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 tg128 750.33 ± 11.71
build: c6328bc (3677)

Performance counter statistics:

#            count  event_name                # count / runtime
    34,793,228,120  cpu-cycles                # 3.012136 GHz      
       325,677,417  stalled-cycles-frontend   # 28.195 M/sec      
    12,701,547,441  stalled-cycles-backend    # 1.100 G/sec       
    80,139,234,122  instructions              # 6.938 G/sec       
         8,967,312  branch-misses             # 776.323 K/sec     
  11550.896083(ms)  task-clock                # 5.611216 cpus used
               226  context-switches          # 19.566 /sec       
             7,976  page-faults               # 690.509 /sec      

excellent works!!!May I ask have you tried Android in arm board witch CPU backend,what is the performance?

@akx akx mentioned this pull request Sep 6, 2024
4 tasks
@fmz
Copy link
Contributor Author

fmz commented Sep 6, 2024

@slaren Most of your comments & suggestions have been addressed. GGML API has been further cleaned up and simplified. Theadpool switching is now transparent, we switch on ggml_backend_cpu_set_threadpool(). Threadpool params have nice defaults.
Process priority setting has been moved into a helper function in common/common.cpp and only called from the sample apps.
src/llama.cpp looks much simpler now. Just a few lines of extra code that selects the threadpool based on the number of tokens, same as selecting n_threads. And a couple of API calls to attach threadpools, those are just passthrough (ie used to pass threadpool to ggml_graph_compute())
As I mentioned above llama-bench needs to explicitly manage threadpool creation because cpu-mask and things are now vectors (ie test params) as you suggested earlier. I included some examples of the output above. It's really neat how it can be used to figure out the best CPU pinning. (unfortunately, this breaks compare-commits.sh for now because I added extra fields and sql tables don't match between branches).
Performance looks pretty good across the board, see the report below (llama-v2-115M on key platforms). It looks like partial offload case (-ngl 10) on M2 Max is doing OK now.
We can iterate further on the automatic threadpool creation and reuse. I suggest we do that in Threadpool-V3 though, after we factor out thread/cpu/numa stuff into ggml-thread.cpp.

M2 Max

CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" make -j
(venv) ~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v2-115m.q4_0.gguf -t 4 -r 20 -ngl 0,10,99
model size params backend ngl threads test t/s
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 pp512 9421.03 ± 97.37
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 tg128 817.88 ± 2.58
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 pp512 46534.56 ± 1708.85
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 tg128 1167.67 ± 8.06
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 pp512 46993.95 ± 1904.04
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 tg128 1169.23 ± 9.39
build: 3246fe8 (3637)
(venv) ~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v2-115m.q4_0.gguf -t 4 -r 20 -ngl 0,10,99
model size params backend ngl threads test t/s
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 pp512 9543.05 ± 50.75
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 tg128 1003.64 ± 4.13
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 pp512 47665.38 ± 1765.27
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 tg128 1165.35 ± 8.50
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 pp512 46802.22 ± 2089.11
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 tg128 1162.56 ± 6.41
build: c6328bc (3677)

Ryzen 9 3950X + RTX 3080

GGML_CUDA=1 make -j llama.cpp-master$ ./llama-bench -m ../gguf/llama-v2-115m.q4_k_m.gguf -t 8 -ngl 0,99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
model size params backend ngl threads test t/s
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 pp512 44574.85 ± 218.52
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 tg128 811.77 ± 4.67
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 pp512 144896.09 ± 446.62
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 tg128 1862.24 ± 56.18
build: 3246fe8 (3637)
GGML_CUDA=1 GGML_NO_OPENMP=1 make -j llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v2-115m.q4_k_m.gguf -t 8 -ngl 0,99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
model size params backend ngl threads test t/s
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 pp512 44386.72 ± 184.30
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 tg128 816.19 ± 3.35
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 pp512 144243.73 ± 363.10
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 tg128 1904.55 ± 64.01
build: c6328bc (3677)

Snapdragon X-Elite

~/src/llama.cpp-master $ ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v2-115m.q4_0_4_8.gguf -t 4
model size params backend threads test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 pp512 5345.36 ± 32.45
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 tg128 743.98 ± 26.26
build: 3246fe8 (3637)
~/src/llama.cpp-threadpool $ ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v2-115m.q4_0_4_8.gguf -t 4
model size params backend threads test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 pp512 5457.88 ± 4.70
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 tg128 1006.58 ± 7.88
build: c6328bc (3677)

Snapdragon Gen 3

Default Android build: armv8.7-a + openmp $ adb shell "cd /data/local/tmp/lmcp; simpleperf stat ./run-bench.sh master llama-v2-115m.q4_0_4_8.gguf -t 6" export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/master' ./master/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v2-115m.q4_0_4_8.gguf -t 6
model size params backend threads mmap test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 pp512 3099.16 ± 2.12
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 tg128 614.70 ± 115.46
build: 3246fe8 (3637)

Performance counter statistics:

#            count  event_name                # count / runtime
    38,854,765,094  cpu-cycles                # 3.014607 GHz      
       357,578,565  stalled-cycles-frontend   # 27.743 M/sec      
    16,329,331,107  stalled-cycles-backend    # 1.267 G/sec       
    85,859,994,043  instructions              # 6.662 G/sec       
        11,675,349  branch-misses             # 905.850 K/sec     
  12888.605049(ms)  task-clock                # 5.478707 cpus used
               934  context-switches          # 72.467 /sec       
             8,267  page-faults               # 641.419 /sec      

Default Android build: armv8.7-a + no-openmp $ adb shell "cd /data/local/tmp/lmcp; simpleperf stat ./run-bench.sh threadpool llama-v2-115m.q4_0_4_8.gguf -t 6" export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/threadpool' ./threadpool/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v2-115m.q4_0_4_8.gguf -t 6
model size params backend threads mmap test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 pp512 3108.87 ± 5.66
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 tg128 750.33 ± 11.71
build: c6328bc (3677)

Performance counter statistics:

#            count  event_name                # count / runtime
    34,793,228,120  cpu-cycles                # 3.012136 GHz      
       325,677,417  stalled-cycles-frontend   # 28.195 M/sec      
    12,701,547,441  stalled-cycles-backend    # 1.100 G/sec       
    80,139,234,122  instructions              # 6.938 G/sec       
         8,967,312  branch-misses             # 776.323 K/sec     
  11550.896083(ms)  task-clock                # 5.611216 cpus used
               226  context-switches          # 19.566 /sec       
             7,976  page-faults               # 690.509 /sec      

excellent works!!!May I ask have you tried Android in arm board witch CPU backend,what is the performance?

That's the Sanpdragon 8 Gen 3 :)

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* Introduce ggml_compute_threadpool

- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems

* Minor fixes

* fixed use after release bug

* fixed a harmless race condition

* Fix Android bulid issue

* fix more race conditions

* fix deadlock for cases where cgraph.n_nodes == 1

and fix --poll case

* threadpool: use cpu_get_num_math to set the default number of threadpool threads

This way we avoid using E-Cores and Hyperthreaded siblings.

* bench: create fresh threadpool for each test

For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).

* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier

This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.

* threadpool: make polling the default to match openmp behavior

All command line args now allow for setting poll to 0 (false).

* threadpool: do not wakeup threads in already paused threadpool

* fix potential race condition in check_for_work

* threadpool: do not create two threadpools if their params are identical

* threadpool: reduce pause/resume/wakeup overhead in common cases

We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.

* threadpool: add support for hybrid polling

poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...

The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.

* threadpool: reduce the number of barrier required

New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.

* threadpool: remove special-casing for disposable threadpools

With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.

Include n_threads in debug print for disposable threadpool.

Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.

* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)

This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.

* threadpool: use relaxed order for chunk sync

Full memory barrier is an overkill for this since each thread works on different chunk

* threadpool: remove abort_callback from threadpool state

* threadpool: better naming for thread/cpumask releated functions

* threadpool: consistent use of int type for n_threads params

* threadpool: add support for ggml_threadpool_params_default/init

Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.

* threadpool: move typedef into ggml.h

* threadpool: fix apply_priority() function name

* threadpool: fix swift wrapper errors due to n_threads int type cleanup

* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled

* threadpool: replace checks for compute_thread ret code with proper status check

* threadpool: simplify threadpool init logic and fix main thread affinity application

Most of the init code is now exactly the same between threadpool and openmp.

* threadpool: update threadpool resume/pause function names

* threadpool: enable openmp by default for now

* threadpool: don't forget to free workers state when omp is enabled

* threadpool: avoid updating process priority on the platforms that do not require it

On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.

* threadpool: update calling thread prio and affinity only at start/resume

This avoids extra syscalls for each graph_compute()

* llama-bench: turn threadpool params into vectors, add output headers, etc

* llama-bench: add support for cool off between tests --delay

This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.

* threadpool: move process priority setting into the apps (bench and cli)

This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.

* threadpool: move all pause/resume logic into ggml

* threadpool: futher api cleanup and prep for future refactoring

All threadpool related functions and structs use ggml_threadpool prefix.

* threadpool: minor indent fixes

* threadpool: improve setprioty error message

* Update examples/llama-bench/llama-bench.cpp

Co-authored-by: slaren <[email protected]>

* threadpool: fix indent in set_threadpool call

* use int32_t for n_thread type in public llama.cpp API

* threadpool: use _new and _free instead of _create and _release

* fix two more public APIs to use int32_t for n_threads

* build: set _GNU_SOURCE for Adroid

---------

Co-authored-by: Max Krasnyansky <[email protected]>
Co-authored-by: fmz <[email protected]>
Co-authored-by: Max Krasnyansky <[email protected]>
Co-authored-by: slaren <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* Introduce ggml_compute_threadpool

- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems

* Minor fixes

* fixed use after release bug

* fixed a harmless race condition

* Fix Android bulid issue

* fix more race conditions

* fix deadlock for cases where cgraph.n_nodes == 1

and fix --poll case

* threadpool: use cpu_get_num_math to set the default number of threadpool threads

This way we avoid using E-Cores and Hyperthreaded siblings.

* bench: create fresh threadpool for each test

For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).

* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier

This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.

* threadpool: make polling the default to match openmp behavior

All command line args now allow for setting poll to 0 (false).

* threadpool: do not wakeup threads in already paused threadpool

* fix potential race condition in check_for_work

* threadpool: do not create two threadpools if their params are identical

* threadpool: reduce pause/resume/wakeup overhead in common cases

We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.

* threadpool: add support for hybrid polling

poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...

The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.

* threadpool: reduce the number of barrier required

New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.

* threadpool: remove special-casing for disposable threadpools

With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.

Include n_threads in debug print for disposable threadpool.

Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.

* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)

This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.

* threadpool: use relaxed order for chunk sync

Full memory barrier is an overkill for this since each thread works on different chunk

* threadpool: remove abort_callback from threadpool state

* threadpool: better naming for thread/cpumask releated functions

* threadpool: consistent use of int type for n_threads params

* threadpool: add support for ggml_threadpool_params_default/init

Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.

* threadpool: move typedef into ggml.h

* threadpool: fix apply_priority() function name

* threadpool: fix swift wrapper errors due to n_threads int type cleanup

* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled

* threadpool: replace checks for compute_thread ret code with proper status check

* threadpool: simplify threadpool init logic and fix main thread affinity application

Most of the init code is now exactly the same between threadpool and openmp.

* threadpool: update threadpool resume/pause function names

* threadpool: enable openmp by default for now

* threadpool: don't forget to free workers state when omp is enabled

* threadpool: avoid updating process priority on the platforms that do not require it

On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.

* threadpool: update calling thread prio and affinity only at start/resume

This avoids extra syscalls for each graph_compute()

* llama-bench: turn threadpool params into vectors, add output headers, etc

* llama-bench: add support for cool off between tests --delay

This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.

* threadpool: move process priority setting into the apps (bench and cli)

This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.

* threadpool: move all pause/resume logic into ggml

* threadpool: futher api cleanup and prep for future refactoring

All threadpool related functions and structs use ggml_threadpool prefix.

* threadpool: minor indent fixes

* threadpool: improve setprioty error message

* Update examples/llama-bench/llama-bench.cpp

Co-authored-by: slaren <[email protected]>

* threadpool: fix indent in set_threadpool call

* use int32_t for n_thread type in public llama.cpp API

* threadpool: use _new and _free instead of _create and _release

* fix two more public APIs to use int32_t for n_threads

* build: set _GNU_SOURCE for Adroid

---------

Co-authored-by: Max Krasnyansky <[email protected]>
Co-authored-by: fmz <[email protected]>
Co-authored-by: Max Krasnyansky <[email protected]>
Co-authored-by: slaren <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* Introduce ggml_compute_threadpool

- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems

* Minor fixes

* fixed use after release bug

* fixed a harmless race condition

* Fix Android bulid issue

* fix more race conditions

* fix deadlock for cases where cgraph.n_nodes == 1

and fix --poll case

* threadpool: use cpu_get_num_math to set the default number of threadpool threads

This way we avoid using E-Cores and Hyperthreaded siblings.

* bench: create fresh threadpool for each test

For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).

* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier

This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.

* threadpool: make polling the default to match openmp behavior

All command line args now allow for setting poll to 0 (false).

* threadpool: do not wakeup threads in already paused threadpool

* fix potential race condition in check_for_work

* threadpool: do not create two threadpools if their params are identical

* threadpool: reduce pause/resume/wakeup overhead in common cases

We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.

* threadpool: add support for hybrid polling

poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...

The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.

* threadpool: reduce the number of barrier required

New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.

* threadpool: remove special-casing for disposable threadpools

With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.

Include n_threads in debug print for disposable threadpool.

Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.

* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)

This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.

* threadpool: use relaxed order for chunk sync

Full memory barrier is an overkill for this since each thread works on different chunk

* threadpool: remove abort_callback from threadpool state

* threadpool: better naming for thread/cpumask releated functions

* threadpool: consistent use of int type for n_threads params

* threadpool: add support for ggml_threadpool_params_default/init

Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.

* threadpool: move typedef into ggml.h

* threadpool: fix apply_priority() function name

* threadpool: fix swift wrapper errors due to n_threads int type cleanup

* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled

* threadpool: replace checks for compute_thread ret code with proper status check

* threadpool: simplify threadpool init logic and fix main thread affinity application

Most of the init code is now exactly the same between threadpool and openmp.

* threadpool: update threadpool resume/pause function names

* threadpool: enable openmp by default for now

* threadpool: don't forget to free workers state when omp is enabled

* threadpool: avoid updating process priority on the platforms that do not require it

On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.

* threadpool: update calling thread prio and affinity only at start/resume

This avoids extra syscalls for each graph_compute()

* llama-bench: turn threadpool params into vectors, add output headers, etc

* llama-bench: add support for cool off between tests --delay

This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.

* threadpool: move process priority setting into the apps (bench and cli)

This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.

* threadpool: move all pause/resume logic into ggml

* threadpool: futher api cleanup and prep for future refactoring

All threadpool related functions and structs use ggml_threadpool prefix.

* threadpool: minor indent fixes

* threadpool: improve setprioty error message

* Update examples/llama-bench/llama-bench.cpp

Co-authored-by: slaren <[email protected]>

* threadpool: fix indent in set_threadpool call

* use int32_t for n_thread type in public llama.cpp API

* threadpool: use _new and _free instead of _create and _release

* fix two more public APIs to use int32_t for n_threads

* build: set _GNU_SOURCE for Adroid

---------

Co-authored-by: Max Krasnyansky <[email protected]>
Co-authored-by: fmz <[email protected]>
Co-authored-by: Max Krasnyansky <[email protected]>
Co-authored-by: slaren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants