Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.0.80 opencv build hangs #844

Closed
bazhenov opened this issue Aug 3, 2023 · 23 comments
Closed

1.0.80 opencv build hangs #844

bazhenov opened this issue Aug 3, 2023 · 23 comments

Comments

@bazhenov
Copy link

bazhenov commented Aug 3, 2023

Starting version 1.0.80 cc is hanging when building opencv crate. The issue is described here twistedfall/opencv-rust#480, but I will provide key details here.

Bug is known to be reproduced on macOS and Linux. When building opencv cc crate doesn't return control and build script hangs:

$ sample `pgrep build-script` 1
[...]
Call graph:
    821 Thread_4064323   DispatchQueue_1: com.apple.main-thread  (serial)
    + 821 start  (in dyld) + 1903  [0x7ff80328541f]
    +   821 main  (in build-script-build) + 24  [0x10d1f1448]
    +     821 std::rt::lang_start::hb0fc3aad8974accf  (in build-script-build) + 40  [0x10d1dbe08]
    +       821 std::rt::lang_start_internal::h36134e47b0b0d769  (in build-script-build) + 786  [0x10d6924e2]
    +         821 std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h2c2e2e7b2e7621dd  (in build-script-build) + 16  [0x10d1dbe30]
    +           821 std::sys_common::backtrace::__rust_begin_short_backtrace::h63357bf7d5a68fc9  (in build-script-build) + 9  [0x10d1db859]
    +             821 core::ops::function::FnOnce::call_once::h75d52d17ab0cabce  (in build-script-build) + 10  [0x10d1f1fea]
    +               821 build_script_build::main::h715d2804cb51a4b6  (in build-script-build) + 5648  [0x10d1f1030]
    +                 821 build_script_build::build_wrapper::h9ac1797fad5be4b1  (in build-script-build) + 1622  [0x10d1ef966]
    +                   821 cc::Build::compile::h1e7d8fe3555207f6  (in build-script-build) + 26  [0x10d2029ea]
    +                     821 cc::Build::try_compile::h47399fd47b24182e  (in build-script-build) + 5269  [0x10d201385]
    +                       821 cc::Build::compile_objects::h69182994bcb75fb8  (in build-script-build) + 1631  [0x10d2030ef]
    +                         821 jobserver::Client::acquire::h7c86558239174c7d  (in build-script-build) + 44  [0x10d2c594c]
    +                           821 jobserver::imp::Client::acquire::hbe0bccde86586504  (in build-script-build) + 33  [0x10d2c4871]
    +                             821 jobserver::imp::Client::acquire_allow_interrupts::h3cd4c76be297277e  (in build-script-build) + 166  [0x10d2c4986]
    +                               821 _$LT$$RF$std..fs..File$u20$as$u20$std..io..Read$GT$::read::h7db08d84478a0c20  (in build-script-build) + 41  [0x10d694059]
    +                                 821 read  (in libsystem_kernel.dylib) + 10  [0x7ff8035a0fde]
    821 Thread_4066738
    + 821 thread_start  (in libsystem_pthread.dylib) + 15  [0x7ff8035dabd3]
    +   821 _pthread_start  (in libsystem_pthread.dylib) + 125  [0x7ff8035df1d3]
    +     821 std::sys::unix::thread::Thread::new::thread_start::ha9b8d2206382b425  (in build-script-build) + 41  [0x10d6a3a99]
    +       821 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hc59063701881b7f1  (in build-script-build) + 13  [0x10d23fb4d]
    +         821 std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h5c8d35b9b4a33fb3  (in build-script-build) + 326  [0x10d24cc36]
    +           821 std::panic::catch_unwind::h41e41edc1400bba9  (in build-script-build) + 9  [0x10d2258c9]
    +             821 std::panicking::try::hf47663c6cbc234c6  (in build-script-build) + 40  [0x10d23f398]
    +               821 __rust_try  (in build-script-build) + 29  [0x10d24584d]
    +                 821 std::panicking::try::do_call::h121a1d4d736758b3  (in build-script-build) + 11  [0x10d23f3db]
    +                   821 _$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::hdb0ab5f8278a72d1  (in build-script-build) + 13  [0x10d2525cd]
    +                     821 std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h5d3743aaa27d25ee  (in build-script-build) + 9  [0x10d24d109]
    +                       821 std::sys_common::backtrace::__rust_begin_short_backtrace::h2fd0eca9d10ceff5  (in build-script-build) + 9  [0x10d2525f9]
    +                         821 cc::PrintThread::new::_$u7b$$u7b$closure$u7d$$u7d$::h4c1dc51bc9bad5f0  (in build-script-build) + 147  [0x10d21c043]
    +                           821 std::io::BufRead::read_line::h5066fa7bdad03704  (in build-script-build) + 39  [0x10d22f5b7]
    +                             821 std::io::append_to_string::h31a23d9498bfe1a8  (in build-script-build) + 69  [0x10d23e965]
    +                               821 std::io::BufRead::read_line::_$u7b$$u7b$closure$u7d$$u7d$::h3d88681ce2db949c  (in build-script-build) + 36  [0x10d23ebc4]
    +                                 821 std::io::read_until::hd601992808268d5e  (in build-script-build) + 71  [0x10d23e727]
    +                                   821 _$LT$std..io..buffered..bufreader..BufReader$LT$R$GT$$u20$as$u20$std..io..BufRead$GT$::fill_buf::hfce0f277aabd63a0  (in build-script-build) + 27  [0x10d234a8b]
    +                                     821 std::io::buffered::bufreader::buffer::Buffer::fill_buf::hedb19888be2227b5  (in build-script-build) + 189  [0x10d22f78d]
    +                                       821 std::io::impls::_$LT$impl$u20$std..io..Read$u20$for$u20$$RF$mut$u20$R$GT$::read_buf::h2b1babb7e81e9e02  (in build-script-build) + 12  [0x10d2254cc]
    +                                         821 _$LT$std..fs..File$u20$as$u20$std..io..Read$GT$::read_buf::hcbb0cdcd42fe7964  (in build-script-build) + 54  [0x10d693eb6]
    +                                           821 read  (in libsystem_kernel.dylib) + 10  [0x7ff8035a0fde]

The issue is happens on 1.0.80 and 1.0.81. Although there are some evidences that 1.0.80 works (see twistedfall/opencv-rust#480 (comment))

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 3, 2023

Ok, I realize there could be one bug in cc::Build::compile_objects that caused this:

Wait-thread could exit early due to error, however the spawn-thread will go on and keep spawning, thus eventually hitting a deadlock.

Would open a PR shortly.

Originally posted by @NobodyXu in twistedfall/opencv-rust#480 (comment)

@bazhenov
Copy link
Author

bazhenov commented Aug 3, 2023

I'm so sorry, this is what is called race condition :)

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 3, 2023

Fix for this is to keep wait thread running until the channel reaches eof.

@twistedfall
Copy link

I've bisected the git tree between 1.0.79 and 1.0.80 the offending commit is ff45d42 (unsurprisingly), not sure how much help that is :)

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 3, 2023

@twistedfall Does v1.0.80 still work fine for you as previously stated?

If so then it is really strange, since v1.0.80 includes this commit.

@twistedfall
Copy link

twistedfall commented Aug 3, 2023

No, it doesn't, I'm not sure why it worked in my previous testing (I have updated that comment)

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 3, 2023

No, it doesn't, I'm not sure why it worked in my previous testing (I have updated that comment)

That's strange, if v1.0.80 somehow works and then stop working suddenly, then I think it could be something else is affecting the build.

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 3, 2023

Is there any other dependencies that also use the jobserver provided by cargo directly (like invoking make, cmake, ninja) or use cc in build.rs?

Could be that their build.rs does something strange, exhaust all the token in the jobserver causing opencv's build script to stuck forever.

@twistedfall
Copy link

Well the build script itself uses the job server: https://github.com/twistedfall/opencv-rust/blob/master/build/generator.rs#L82, but it releases the acquired tokens, so it should be fine. But I'm reading the docs for the jobserver and it looks like it might not be possible to call Client::from_env twice which what's currently happening.

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 3, 2023

@twistedfall In your build/generator.rs you create a jobserver::Client using Client::from_env and dropped it, this probably has something to do with the error.

jobserver::Client::from_env is unsafe for a reason: It takes ownership of the jobserver specified in environment.

When you drop a jobserver::Client, you also closed the underlying fds.
If the jobserver is passed via fds (pipe), then it will be closed.

But the fds specified in the environment variable isn't changed, so you could be just referring to random fds and if it is a valid fd, then jobserver::Client::from_env would accept that and now we could have random behavior.

If the jobserver is passed via fifo (path to a named pipe), then it will be fine.

Edit:

This also explains why this can only be reproduced on Linux and MacOS, but not on Windows, since jobserver on Windows uses named fifo.

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 3, 2023

@twistedfall I have a fork jobslot which has its jobslot::Client::from_env clones the fds instead of taking ownership, so it might be able to work in your situation.

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 3, 2023

@twistedfall Another solution is to wrap the jobserver::Client with std::mem::ManuallyDrop and never drops it.

It should be fine given that it's only used in build-script and it will get cleaned up by OS on exits.

@twistedfall
Copy link

That would work I think, but it looks like it's only time before another crate uses jobserver and cc with parallel feature at the same time and encounters the same problem. Maybe it's possible for cc to accept externally created jobserver in something like try_compile_with_jobserver()? That would be the only clean solution that I see that would allow both build script and cc to continue using jobserver.

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 3, 2023

That would work I think, but it looks like it's only time before another crate uses jobserver and cc with parallel feature at the same time and encounters the same problem.

I could open a PR in jobserver to clone the fds instead of taking their ownership.

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 3, 2023

Maybe it's possible for cc to accept externally created jobserver in something like try_compile_with_jobserver()? That would be the only clean solution that I see that would allow both build script and cc to continue using jobserver.

This will clearly be behind a feature flag and add complexity to the existing API and implementation since it would require duplication since now compile_objects with or without feature flag parallel has different APIs internally now, not sure whether this is a good idea.

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 3, 2023

@twistedfall I've opened a PR in jobserver-rs for this use case.

@bazhenov
Copy link
Author

bazhenov commented Aug 3, 2023

That's strange, if v1.0.80 somehow works and then stop working suddenly, then I think it could be something else is affecting the build.

@NobodyXu As you previously mentioned the compile_objects() is waiting for results from jobserver which is in parallel thread. The issue might reproduce non deterministically under some particular conditions. Literally race.

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 3, 2023

@NobodyXu As you previously mentioned the compile_objects() is waiting for results from jobserver which is in parallel thread.

Please check my comment above, it's likely this is caused by use of unsafe function jobserver::Client::from_env in opencv build.rs .

The issue might reproduce non deterministically under some particular conditions. Literally race.

It is nit "race" in traditional sense since it is perfectly memory safe, just that the blocking nature of jobserver::Client::acquire making it a bit hard to write correct code without use of async.

Beaides, it's verified that #846 does not fix this, so it's something else.

@twistedfall
Copy link

@twistedfall I've opened a PR in jobserver-rs for this use case.

Thanks a lot for that, this should be the best solution!

@twistedfall
Copy link

@NobodyXu I have tried switching to jobslot and I can confirm that it works together with cc version 1.0.81, but there is a change in parallel behavior, it looks like the new version is not utilizing the scheduler fully. The cpp compilation phase takes around 16seconds with the cc 1.0.79, but around 20 with cc 1.0.81. I have some quick CPU load graphs, there are 16 CPU threads available.

  • v1.0.79, you can clearly see two batches: binding generation and cpp compilation:
    CPU-1 0 79
  • v1.0.81, the first batch for binding generation is clearly visible, but the cpp compilation is no longer batched together and doesn't load the CPU fully which explains the increase in build time:
    CPU-1 0 81

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 4, 2023

@twistedfall Thanks for report!
This is worth investigating, since v1.0.81 should actually improve performance by reducing amount of thread created.

It seems that the cc v1.0.81 wasn't able to spawn compilation task fast enough, it could be blocked waiting for jobserver token to be released by the wait thread.

@NobodyXu
Copy link
Collaborator

NobodyXu commented Aug 4, 2023

@twistedfall I've opened #849 which should fix (or at least reduce) the performance regression.

@NobodyXu
Copy link
Collaborator

Closing this as it should have been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants