-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU utilisation drop when increasing number of threads with threading
#118649
Comments
Referring to @JelleZijlstra 's #118153 (comment):
Yes, I am aware of that. My intention is to report the issue and possibly track the resolution of it. |
I ran this on num_threads = 1 -> 0.34 s And If I disable turbo I get: num_threads = 1 -> 0.68 s |
Thank you for checking this. These are interesting results, however I couldn't get to this state on my end. I've put together minimal repro (below). Did I miss some configuration flag for a free-threaded environment? Setup:
Repro:
import math
import time
import threading
def computational_heavy(iterations):
val = 0
sin = math.sin
cos = math.cos
for i in range(1, iterations):
val += sin(i) * cos(i)
return val
def test(thread_id, iterations=1000000):
computational_heavy(iterations)
num_threads = [4, 14, 4, 14, 4, 14]
for nt in num_threads:
threads = [
threading.Thread(target=test, name=f"Thread{i}", args=(i,))
for i in range(nt)
]
start = time.perf_counter_ns()
for t in threads:
t.start()
for t in threads:
t.join()
stop = time.perf_counter_ns()
print(f"{nt=}. Elapsed time {stop-start} ns") |
You are using Python 3.13a4 (from February). Many of the scaling bottlenecks were addressed in the last week or two. You need to use Python 3.13b1 or newer. |
Thank you for pointing this out. I've changed the environment to the beta release (had to add Dockerfile:
$ docker run -it -v $(pwd):/test -w /test python3.13-nogil python3 --version && python3 test.py
Python 3.13.0b1
nt=4 Elapsed time 468306427 ns
nt=14 Elapsed time 1671326595 ns
nt=4 Elapsed time 480656613 ns
nt=14 Elapsed time 1644104810 ns
nt=4 Elapsed time 470198085 ns
nt=14 Elapsed time 1655974063 ns |
Did you try it without the |
Yes, the minimal repro I've provided above contains only Python (not even numpy there). Attaching it once again here, just to keep everything in one place:
import math
import time
import threading
def computational_heavy(iterations):
val = 0
sin = math.sin
cos = math.cos
for i in range(1, iterations):
val += sin(i) * cos(i)
return val
def test(thread_id, iterations=1000000):
computational_heavy(iterations)
num_threads = [4, 14, 4, 14, 4, 14]
for nt in num_threads:
threads = [
threading.Thread(target=test, name=f"Thread{i}", args=(i,))
for i in range(nt)
]
start = time.perf_counter_ns()
for t in threads:
t.start()
for t in threads:
t.join()
stop = time.perf_counter_ns()
print(f"{nt=}. Elapsed time {stop-start} ns")
|
Bug report
Bug description:
Hi :)
This issue is essentially a re-open of #118153, but with some more experiments added.
I'm testing the free-threaded Python build. I'm running a simple test (code below), which triggers a computationally heavy function across CPU cores using
threading
module. Time measurements of the script are the following:In ideal world, I believe I could expect the three numbers above to be the same (or comparable). I've also gathered the profiles of the experiment:
num_threads = 2
num_threads = 8
num_threads = 18 (showing only some threads, but the picture illustrates the issue)
As we can see, the CPU utilisation decreases with the number of CPU threads used (almost 99% for
nt=2
, about 75% fornt=8
and ~40% fornt=18
). We also see increased CPU core switching frequency. My guess is that the reason of decreased CPU utilisation is the overhead on thethreading
module.Running some further experiments, I've run the program on a high number of threads (thus according to previous observation, the CPU utilisation should be low), but with both busy and idle wait on 10 threads and actual
sin*cos
computation on 2 threads. In both of these scenarios, we observe high CPU utilisation on worker threads:idle wait (implemented with time.sleep)
busy wait (implemented with while loop)
Interestingly, zooming-in to be CPU utilisation profile (the "slow" case) we do see that there are parts in the timeline, where CPU is saturated and all threads are working in parallel. However, there are also periods, where CPU utilisation is scattered:
Lastly, as a sanity check, the same operation implemented in C++:
May this be a bug inside
threading
module? I've went through PEP 703, but I've seen no mention about this part. If the overhead onthreading
is the root cause of lowered utilisation, may this issue be addressed?@colesbury , tagging you here since I believe you'd know most about the free-threaded Python build. Should this issue be added to the list in #108219?
Testing configuration:
CPython build command:
Testing script:
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
The text was updated successfully, but these errors were encountered: