Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel 7.4.1 Sporadic Hangs and Server Terminated when tagging tests with resource tags #24506

Open
Ryang20718 opened this issue Nov 27, 2024 · 1 comment

Comments

@Ryang20718
Copy link

Ryang20718 commented Nov 27, 2024

Description of the bug:

We've recently upgrade to Bazel 7.4.1 from 6.5.0 and we've been noticing a lot of flakiness on our CI runners.

Some behaviors we've seen are

  • Bazel Server Hung when running bazel coverage (no output logged to stout for a couple of minutes and then server is terminated (the time between server terminated and last log output may range from 5-10s to upwards of 5 - 10 minutes. Bazel fails to recover (peak memory usage is only 70% of server total allocated memory)
Server terminated abruptly (error code: 14, error message: 'Connection reset by peer', log file: '/home/github_actions/.cache/bazel/_bazel_github_actions/049fd0d9a142b0eee346c643b8cf35e6/server/jvm.out')
  • Bazel Tests timeout after hitting test time threshold (peak CPU utilization is ~70% when timing out)

we've also set the following flags to attempt to alleviate this CPU timeout

common --experimental_worker_for_repo_fetching=off
common --experimental_sandbox_async_tree_delete_idle_threads=0
test --local_resources=cpu=HOST_CPUS-4

I'm not entirely sure if these 2 are related, but luckily I was able to catch a thread dump on a worker with a hung bazel test

it took a while for me to get a stack as well

226872: Unable to access root directory /proc/226872/root of target process 226872
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded

hung.txt
strace_hang.txt

I was able to capture an strace as well as java thread dump on a instance where it's hung. we have a monorepo with multiple languages. specifically, we tag our python tests with cpu, gpu_memory, and memory as well and leave the rest of the tests without tags (since those tests aren't as hefty.)

test:ci --local_resources=gpu_memory_mb=15360  --local_resources=memory=HOST_RAM*0.6

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I don't have a minimal repro that I can openly share (day job has a monorepo)

Which operating system are you running Bazel on?

ubuntu 20.04

What is the output of bazel info release?

release 7.4.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

Any other information, logs, or outputs that you want to share?

We're using local execution with a remote cache (grpc)

bazel rc flags

Inherited 'common' options: --experimental_repository_cache_urls_as_default_canonical_id --watchfs --@io_bazel_rules_docker//transitions:enable=false --ui_actions_shown=32 --experimental_remote_cache_eviction_retries=5 --experimental_remote_cache_lease_extension --noexperimental_inmemory_dotd_files --experimental_worker_for_repo_fetching=off --experimental_sandbox_async_tree_delete_idle_threads=0 --incompatible_default_to_explicit_init_py --incompatible_allow_tags_propagation --experimental_cc_shared_library --heap_dump_on_oom 

  Inherited 'build' options: --output_filter=^// --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --compilation_mode=opt --host_compilation_mode=opt --interface_shared_objects --use_top_level_targets_for_symlinks=false --java_runtime_version=remotejdk_11 --@rules_rust//rust/settings:experimental_use_cc_common_link=true --@rules_cuda//cuda:runtime=//third_party:cuda_runtime --@rules_cuda//cuda:archs=compute_61:sm_61;compute_70:sm_70;compute_75:sm_75;compute_80:sm_80,compute_80 --@rules_cuda//cuda:copts=--std=c++17 --incompatible_strict_action_env=true --incompatible_enable_cc_toolchain_resolution --sandbox_base=/dev/shm --sandbox_tmpfs_path=/tmp --workspace_status_command=tools/get_workspace_status --action_env CACHE_EPOCH=1673041430 --flag_alias=python_flag=//rules:python_flags --flag_alias=python_monitor_flag=//rules:python_monitor_flag --flag_alias=use_repo_bridge_binary=//waabi/onboard/bin/bridge:enabled --aspects=@rules_rust//rust:defs.bzl%rust_clippy_aspect --experimental_repository_cache_hardlinks --nobuild
@sgowroji sgowroji added coverage team-Configurability platforms, toolchains, cquery, select(), config transitions labels Nov 27, 2024
@Ryang20718 Ryang20718 changed the title Bazel 7.3.2 Sporadic Hangs and Server Terminated Bazel 7.4.1 Sporadic Hangs and Server Terminated Nov 29, 2024
@Ryang20718
Copy link
Author

Ryang20718 commented Nov 29, 2024

it took a while for me to get a stack as well

226872: Unable to access root directory /proc/226872/root of target process 226872
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded

hung.txt
strace_hang.txt

I was able to capture an strace as well as java thread dump on a instance where it's hung

@Ryang20718 Ryang20718 changed the title Bazel 7.4.1 Sporadic Hangs and Server Terminated Bazel 7.4.1 Sporadic Hangs and Server Terminated when tagging tests with resource tags Nov 29, 2024
@gregestren gregestren added team-Local-Exec Issues and PRs for the Execution (Local) team and removed team-Configurability platforms, toolchains, cquery, select(), config transitions labels Dec 20, 2024
@oquenchil oquenchil self-assigned this Jan 7, 2025
@oquenchil oquenchil added coverage and removed coverage team-Local-Exec Issues and PRs for the Execution (Local) team labels Jan 7, 2025
@oquenchil oquenchil assigned c-mita and unassigned oquenchil Jan 7, 2025
@c-mita c-mita added coverage and removed coverage labels Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants