Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel 7.4.0 build hangs indefinitely sometimes #24520

Open
TeoMihuc opened this issue Nov 28, 2024 · 6 comments
Open

Bazel 7.4.0 build hangs indefinitely sometimes #24520

TeoMihuc opened this issue Nov 28, 2024 · 6 comments
Labels
team-Performance Issues for Performance teams type: bug untriaged

Comments

@TeoMihuc
Copy link

TeoMihuc commented Nov 28, 2024

Description of the bug:

We are using bazel 7.4.0 and we've noticed random build hangs in our CI and locally.

Setup:

  • using bazel test inside a docker build RUN step. Bazel hangs indefinitely the docker build at this step always: 5 / 5 tests; no actions running. The docker build runs on a wsl linux amd device and we are doing an arm crosscompilation using: docker run --privileged --rm tonistiigi/binfmt:latest --install all
  • we are using these flags (amongst others): --spawn_strategy=processwrapper-sandbox --strategy=Javac=processwrapper-sandbox --noremote_accept_cached + remote cache
  • a variation of the flags didn't change the outcome( local cache as well)

What is your hunch on this one? Is there something else I can try?

Thanks a lot!

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

(Ubuntu 22.04.5 LTS) linux cross-compilation to arm docker image

What is the output of bazel info release?

release 7.4.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

From https://github.com/bazelbuild/bazel/releases, using: bazel-7.4.0-linux-arm64

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

Similar issues:

Any other information, logs, or outputs that you want to share?

28.11.2025_debug_bazel_build.txt

this is the output I receive for a CTRL \ when bazel hangs.

The hanging happens in the CI as well, can't get an output debug from there.

@tjgq
Copy link
Contributor

tjgq commented Nov 28, 2024

The attached stack trace is either from Bazelisk or some other wrapper around the Bazel client. You must send the SIGQUIT to the Bazel server process instead (the result should look like a Java thread dump, not Go).

To find the Bazel server pid you can look in $OUTPUT_BASE/server/server.pid.txt. The thread dump will be written to $OUTPUT_BASE/server/jvm.out.

@TeoMihuc
Copy link
Author

TeoMihuc commented Nov 28, 2024

sorry, attached.
jvm.txt

@Ryang20718
Copy link

Ryang20718 commented Nov 29, 2024

We're on bazel 7.4.1 and we're also hitting this. (albeit with linux-sandbox rather than processwrapper-sandbox. Here's a thread dump and strace of linux-sandbox when the process was hung. We're hitting this when tagging a subset of tests (python) in CI with resource tags and not tagging the others (such as Go)

i.e

tags = ["cpu:3"]

it took a while for me to get a stack as well

226872: Unable to access root directory /proc/226872/root of target process 226872
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded

hung.txt
strace_hang.txt

@tjgq
Copy link
Contributor

tjgq commented Dec 3, 2024

@TeoMihuc Unfortunately the thread dump doesn't suggest a probable cause. Some followup questions to try to narrow this down:

  • Are you able to reproduce it outside of Docker?
  • Are you able to reproduce it with sandboxing disabled (i.e., using standalone instead of processwrapper-sandbox)?
  • Can you please provide the complete list of flags you're using (feel free to redact sensitive information) in case there's something else in there that might provide a clue?

@tjgq
Copy link
Contributor

tjgq commented Dec 3, 2024

@Ryang20718 Your issue looks distinctly different: there are two threads seemingly stuck somewhere in JVM code (Unsafe.getLong and Class.getModule). Can you confirm that the thread dump was obtained after Bazel hung, not at a random point while it was still doing work? Does the thread dump consistently show two threads hung at these two points?

@Ryang20718
Copy link

@tjgq i can try to grab another thread dump on hang to confirm, but I captured this dump after confirming the test processes were hung (no output on strace)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Performance Issues for Performance teams type: bug untriaged
Projects
None yet
Development

No branches or pull requests

6 participants