You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've recently upgrade to Bazel 7.4.1 from 6.5.0 and we've been noticing a lot of flakiness on our CI runners.
Some behaviors we've seen are
Bazel Server Hung when running bazel coverage (no output logged to stout for a couple of minutes and then server is terminated (the time between server terminated and last log output may range from 5-10s to upwards of 5 - 10 minutes. Bazel fails to recover (peak memory usage is only 70% of server total allocated memory)
Server terminated abruptly (error code: 14, error message: 'Connection reset by peer', log file: '/home/github_actions/.cache/bazel/_bazel_github_actions/049fd0d9a142b0eee346c643b8cf35e6/server/jvm.out')
Bazel Tests timeout after hitting test time threshold (peak CPU utilization is ~70% when timing out)
we've also set the following flags to attempt to alleviate this CPU timeout
common --experimental_worker_for_repo_fetching=off
common --experimental_sandbox_async_tree_delete_idle_threads=0
test --local_resources=cpu=HOST_CPUS-4
I'm not entirely sure if these 2 are related, but luckily I was able to catch a thread dump on a worker with a hung bazel test
it took a while for me to get a stack as well
226872: Unable to access root directory /proc/226872/root of target process 226872
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
I was able to capture an strace as well as java thread dump on a instance where it's hung. we have a monorepo with multiple languages. specifically, we tag our python tests with cpu, gpu_memory, and memory as well and leave the rest of the tests without tags (since those tests aren't as hefty.)
226872: Unable to access root directory /proc/226872/root of target process 226872
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
I was able to capture an strace as well as java thread dump on a instance where it's hung
Ryang20718
changed the title
Bazel 7.4.1 Sporadic Hangs and Server Terminated
Bazel 7.4.1 Sporadic Hangs and Server Terminated when tagging tests with resource tags
Nov 29, 2024
Description of the bug:
We've recently upgrade to Bazel 7.4.1 from 6.5.0 and we've been noticing a lot of flakiness on our CI runners.
Some behaviors we've seen are
bazel coverage
(no output logged to stout for a couple of minutes and then server is terminated (the time between server terminated and last log output may range from 5-10s to upwards of 5 - 10 minutes. Bazel fails to recover (peak memory usage is only 70% of server total allocated memory)we've also set the following flags to attempt to alleviate this CPU timeout
I'm not entirely sure if these 2 are related, but luckily I was able to catch a thread dump on a worker with a hung bazel test
it took a while for me to get a stack as well
hung.txt
strace_hang.txt
I was able to capture an strace as well as java thread dump on a instance where it's hung. we have a monorepo with multiple languages. specifically, we tag our python tests with cpu, gpu_memory, and memory as well and leave the rest of the tests without tags (since those tests aren't as hefty.)
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
I don't have a minimal repro that I can openly share (day job has a monorepo)
Which operating system are you running Bazel on?
ubuntu 20.04
What is the output of
bazel info release
?release 7.4.1
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.Any other information, logs, or outputs that you want to share?
We're using local execution with a remote cache (grpc)
bazel rc flags
The text was updated successfully, but these errors were encountered: