Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : make n_threads_cur atomic_int #9441

Closed
wants to merge 1 commit into from
Closed

Conversation

ggerganov
Copy link
Owner

On Mac without OpenMP, I get the following thread sanitizer error:

make -j && ./bin/llama-cli --model ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello"
llama_new_context_with_model:        CPU compute buffer size =     8.01 MiB
llama_new_context_with_model: graph nodes  = 710
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 16 (n_threads_batch = 16) / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling seed: 2090654531
sampling params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler constr: 
	logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 2048, n_batch = 2048, n_predict = -1, n_keep = 1


 Hello==================
WARNING: ThreadSanitizer: data race (pid=27473)
  Read of size 4 at 0x000108a3c0a4 by thread T23:
    #0 ggml_graph_compute_ready ggml.c:19932 (libggml.dylib:arm64+0xf3f3c)
    #1 ggml_graph_compute_poll_for_work ggml.c:19946 (libggml.dylib:arm64+0xf3890)
    #2 ggml_graph_compute_check_for_work ggml.c:19957 (libggml.dylib:arm64+0xf33a8)
    #3 ggml_graph_compute_secondary_thread ggml.c:19999 (libggml.dylib:arm64+0xf2d24)

  Previous write of size 4 at 0x000108a3c0a4 by main thread:
    #0 ggml_graph_compute ggml.c:20151 (libggml.dylib:arm64+0x98e20)
    #1 ggml_backend_cpu_graph_compute ggml-backend.c:817 (libggml.dylib:arm64+0x296098)
    #2 ggml_backend_graph_compute_async ggml-backend.c:282 (libggml.dylib:arm64+0x269a68)
    #3 ggml_backend_sched_compute_splits ggml-backend.c:1818 (libggml.dylib:arm64+0x28b57c)
    #4 ggml_backend_sched_graph_compute_async ggml-backend.c:2006 (libggml.dylib:arm64+0x288c14)
    #5 llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_threadpool*) llama.cpp:16053 (libllama.dylib:arm64+0xff0e4)

  Location is heap block of size 184 at 0x000108a3c000 allocated by main thread:
    #0 posix_memalign <null>:53366404 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x564c0)
    #1 ggml_aligned_malloc ggml.c:342 (libggml.dylib:arm64+0x10480)
    #2 ggml_threadpool_new_impl ggml.c:20066 (libggml.dylib:arm64+0x967b0)
    #3 ggml_threadpool_new ggml.c:20127 (libggml.dylib:arm64+0x9674c)
    #4 main main.cpp:243 (llama-cli:arm64+0x100005578)

  Thread T23 (tid=263822252, running) created by main thread at:
    #0 pthread_create <null>:53366404 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x3062c)
    #1 ggml_threadpool_new_impl ggml.c:20108 (libggml.dylib:arm64+0x9802c)
    #2 ggml_threadpool_new ggml.c:20127 (libggml.dylib:arm64+0x9674c)
    #3 main main.cpp:243 (llama-cli:arm64+0x100005578)

SUMMARY: ThreadSanitizer: data race ggml.c:19932 in ggml_graph_compute_ready
==================
. I'm doing a course with a group of 20 students in a group of 5. I want to do this for 2 hours a week, so I have to give them a lesson every Monday and Wednesday. I have 5 students. I have to find a teacher for all of them. I would like to have an answer at the end of the semester.

For the moment I want to know who are the 20 students of this group, and what is the frequency of their participation.

I'd like to know the number of students that took part in 2 hours lessons every week

I'd like to know the number of students
llama_perf_print:    sampling time =      14.57 ms /   148 runs   (    0.10 ms per token, 10156.46 tokens per second)
llama_perf_print:        load time =     250.03 ms
llama_perf_print: prompt eval time =     377.17 ms /     2 tokens (  188.58 ms per token,     5.30 tokens per second)
llama_perf_print:        eval time =    1892.62 ms /   145 runs   (   13.05 ms per token,    76.61 tokens per second)
llama_perf_print:       total time =    2310.02 ms /   147 tokens
ThreadSanitizer: reported 1 warnings

This patch fixes it, though I'm not sure if there is a better solution

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 11, 2024
@slaren
Copy link
Collaborator

slaren commented Sep 11, 2024

This is a strange case. The access to n_threads_cur should already be synchronized by the n_graph atomic (although the memory_order of this access probably needs to be changed from relaxed). However, there is a data race when there are two computations in a row in which fewer threads are used than the threadpool contains, especially if there is only one thread since in that case the pools threads are never actually used.

Because the unused threads are not synchronized at the end of the computation, what happens is that ggml_graph_compute returns before the threads are done checking for work in ggml_graph_compute_ready, and in fact, the next compute starts even before the threads are done checking for work of the previous graph compute, which in turn results in this data race.

Making n_threads_cur atomic alone is not enough because in some systems this is just a volatile that does not perform any synchronization between threads, and the atomic_load/atomic_store functions would need to be used in all the accesses of this variable. And since this variable is accessed all the time during the computation, eg. in ggml_barrier, it would probably be better if it is not an atomic.

It's also worth noting that this is a very common case, when fully offloading a model to the GPU, the CPU will only be used with 1 thread, but the threadpool will still be created with the full number of threads. It would be very desirable if the additional threads are sleeping on a mutex rather than constantly spinning looking for work.

cc @fmz @max-krasnyansky, any suggestions?

@max-krasnyansky
Copy link
Collaborator

@slaren @ggerganov
I started the new PR #9461 that addresses:

  • thread sanitizer issue
  • makes n_thread_cur accesses explicit (relaxed memory order is fine in this case so there is no extra overhead)
  • avoids polling from

@ggerganov
Copy link
Owner Author

Superseded by #9461

@ggerganov ggerganov closed this Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants