-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Triaging/improving the number of crates classified as build-fail #589
Comments
The OOM problem seems to disproportionately afflict the most-downloaded crates ( 202 of the top 1,000 crates are currently routinely failing crater runs. Of those 202, at least 76 are OOMs. |
A few years later, a new datapoint, now on the rust 1.81 release cycle: I was just looking through the crater run for PR 116088 to see why it didn't report a failure on one of my crates, and it's because .. it wouldn't compile at all. Because it was SIGKILL'ed trying, on both In the build-fail directory, I ran this:
This extracted every invocation of rustc from the That produced a file with 18264 lines -- 18k of the 121k build-fails are SIGKILLs. And they're fairly concentrated: only 2121 unique crates were being built when the signal arrived, and of that only 21 crates happened more than 100 times:
A lot of these are fairly memory-intensive to build. I think it's likely that they are in fact all OOMs. In which case you're losing a fair amount of signal to a memory limit. How bad / implausible would it be to raise the limit? |
I'm writing this up because on the Zulip, simulacrum suggested nobody had done this before. If this is already known I hope i odn't seem patronizing by writing it up.
There were a number of regressions related to a recent LLVM version bump which resulted in rampant resource utilization for a number of crates. To a non-expert this is confusing, shouldn't crater have detected that? Unfortunately, it looks like crater has a serious problem with OOMs generally: #564 #562 #544 #516 #490 #484. I'm hoping that by looking into a lot of the available logs I can make some suggestions, maybe put in a PR (though I've never worked in this codebase) and generally improve the quality of crater runs.
I did some very basic poking around all of 1.54, 1.55, 1.56, 1.57, and 1.58 runs, only considering published crates. It looks like the sort of failures we see is basically the same from version to version. So I focused on just 1.58 because I'm much more concerned about systematic behavior among the build failures and what can be done about it.
The 1.58 run has 14,921
build-fail/reg
crates. Of those...#!{experimental]
include_bytes!
/include_str!
some file that's not in the repoThere are also a lot that I didn't categorize, such as attempts to compile with macOS frameworks, use of
llvm_asm!
, missingeh_personality
, and a lot of crates that require the user to turn on a non-default feature to build.My biggest concern with the current setup is that the number of CPUs that a build is spawned with is sporadic, and this alone causes a significant number of OOMs. The most hilarious case of this that I've found is
memx
. The author has quite diligently written 35 integration test binaries, which means on a 64-core machine each integration test has only 44 MB to work with. That's enough for rustc actually, but not for all theld
processes.regex
fails most crater runs for the same reason, but its codebase is much more memory-intensive. With only 4 CPUs,regex
will OOM building tests.This is why the vast, vast majority of spurious-fixed and spurious-regressed crates are regressed to or fixed from an OOM. They OOM when they're randomly assigned to an environment that happens to have too many CPUs, then most likely are assigned to an environment next time that has far fewer.
The build timeouts (
no output for 300 seconds
) are also interesting. Since there are only 162 of them, I tried to reproduce all of them myself. Most of them are not reproducible. But I did find a few true positives lurking in there:savage
, andsdc-parser
push the 1.5 GB limit even with a single job. They probably look like a timeout but only on account of the memory limit.fungui
could possibly be considered compiler hangs on Rust 1.57, but not on the current nightly. It's not clear to me if crater could have spotted a compile time regression that it otherwise missed if this were noticed.ilvm
needs 30 minutes to compile on Rust 1.57, the 1.58 beta, and current nightly. I think it qualifies as a compiler hang, the codebase is pretty small and simple for that long of a compilation.If we only saw 4 build timeouts instead of 162, perhaps they could have been manually inspected on every crater run. So perhaps there's an opportunity here?
Some ideas:
The root problem with all the spurious OOMs is that the peak memory usage of a build scales with the number of CPUs available, but crater doesn't scale the available memory up even as it scales the number of available CPUs randomly by a factor of 10 or more. Setting a job limit on cargo would only be a partial solution because there are plenty of build scripts that compile C libraries that fan out parallelism to the number of CPUs detected. I think it would be a huge improvement to limit the number of CPUs or provide a memory limit that scales up with the number of CPUs.
The build timeouts as well as those things that crater summaries already categorize as are also quite interesting. Quite a few just look like this:
This crate with the same version was test-pass in 1.56, test-fail in 1.57, and error in 1.58. This sort of output smells like a transient network error. Is there a retry mechanism for crate builds? And even if there isn't, it would be good to get a lot more logging related to downloads so that we could have more hope of diagnosing these. The
error
crates aren'tbuild-fail
(which is what the title says) but this seems like the same pathology as the timeout crates suffer from; almost like a sudden loss of networking.The text was updated successfully, but these errors were encountered: