Fixup deadlock when hammering futures of a certain type #88

jhugman · 2024-09-05T16:18:22Z

According to The Big O of Code Reviews, this is a O(n) change.

This PR moves the polling of futures to outside of the continuation callback; this has shown to be a source of deadlocks.

      // From https://github.com/mozilla/uniffi-rs/pull/1837/files#diff-8a28c9cf1245b4f714d406ea4044d68e1000099928eaca1afb504ccbc008fe9fR35-R37
      //
      // > WARNING: the call to [rust_future_poll] must be scheduled to happen soon after the callback is
      // > called, but not inside the callback itself.  If [rust_future_poll] is called inside the
      // > callback, some futures will deadlock and our scheduler code might as well.
      //
      // This delay is to ensure that `uniffiFutureContinuationCallback` returns before the next poll, i.e.
      // so that the next poll is outside of this callback.
      //
      // The length of the delay seems to be significant (at least in tests which hammer a network).
      // I would like to understand this more: I am still seeing deadlocks when this drops below its current
      // delay, but these maybe related to a different issue, as alluded to in
      // https://github.com/mozilla/uniffi-rs/pull/1901

Unfortunately, the tests for this aren't likely to end up in this repo, since it relies on a fork of the matrix-sdk.

As said in the comment, I'm don't think this has fixed all possible deadlocks, but I'm hoping this has at least fixed the proximal problem.

Fixes #89

zzorba · 2024-09-05T18:56:09Z

Thank you for the fix, it does work on our branch and the deadlock is gone now.

Though the fact that 100ms is the threshhold it starts working at makes me worry a little bit that we've just worked around the call?

Does the extra poll need to be applied each time, or is it the first time that matters the most (I thought the uniffi task was saying that it was related to setup + poll being too close to each other?

zzorba

Approve as a viable workaround, though I'd love to better understand the 'why' if possible.

jhugman · 2024-09-06T15:27:04Z

My understanding is that the poll is protected by a mutex.

A call to the poll on the Rust side:

takes the mutex lock. This is not a re-entrant mutex.
does some work
call into the JS uniffiFutureContinuationCallback with an update on the status of the task.
then releases the mutex lock.

Before this change, the uniffiFutureContinuationCallback would:

resolve the promise that the poll loop is awaiting on.
the polling loop either ends, and the completeFunc is called OR:
calls back into the Rust poll, which would then try to acquire the lock currently held.
because of this deadlock, the uniffiFutureContinuationCallback never returns to release that lock.

The fix makes the work done in uniffiFutureContinuationCallback just a setTimeout to schedule the next poll, instead of the poll itself. The call to setTimeout returns immediately, so uniffiFutureContinuationCallback can return and unlock the mutex before the next poll.

This analysis might suggest that we could be reliably reproduced by setting up a situation where the pollResult is MAYBE_READY.

What I can't explain, if the above is true, but can hazard a guess at:

what is the significance of the 100ms? Why can't this be 0, i.e. just use setImmediate?

At this point, it's worth pointing out:

the test-harness implements a rudimentary event loop. There's setTimeout and friends I lifted from @tmikov's excellent hermes-jsi-demos. On top of that, I have implemented a CallInvoker which stands in for the react::CallInvoker. I have not verified that the behaviour of the event loop/micro tasks etc is standards compliant, or matches how react-native does it.
the build process is emitting very old style JS. It's still using Metro, but there is a tsc step before it. This would suggest that the JS being executed in test may be quite different to the JS running on device. Is this a problem? I don't think so, but it is a source of code difference, so perhaps.

@zzorba Would you be able to change the 100 to 1 or zero, and try it out within React Native?

What still confuses me: if the above is true—

why don't we see this every time a MAYBE_READY is returned?
why do we still see the occasional deadlock during test? Will we see the occasional deadlock on device? Are there other deadlocks lurking in this code.

Eyeballing uniffi's future.rs suggests there are two Mutexes per future. The principle one we've been discussing is the scheduler one, but perhaps there are some ordering of locking and unlocking that the JS code is getting it into. More research needed.

…itives

jhugman · 2024-09-17T16:43:59Z

The continuation callback needed to be called and done by the time next poll happened.

Tests for this are in #89 . It is to load the same image 2, 4, 8, 16, 32, 64, 128, 256, 512, 512 times.

Things I tried:

I rewrote UniffiCallInvoker::invokeBlocking to use lighter-weight primitives. This increased the success rate of completion, but still getting deadlocks most of the time (about 3 out of 7 runs).
I tried to schedule the next poll on to the next tick of the event loop. Again, this increased the success rate to about 5 out of 10 runs.
Adding a UniffiCallInvoker::invokeNonBlocking which just adds a callback call to the event queue. This increased the success rate to 10/10. I added timings at this point. loading 512 images took ~40 seconds. 1024 images took ~600 seconds.
This is definitely super-linear. Are we polling too much? Is the event queue just getting flooded?
Removing next tick work reduced the time to load 512 images to ~7 seconds, and 1024 images ~15 seconds.

I think am ready to declare victory on this now.

In the end, I think only faf0f89 is needed here. There is almost no changes in async-rust-call.ts.

I'm going to ask @zzorba for another review.

zzorba

LGTM, confirmed the fix in the full app, its working great.

The change here is in `UniffiCallInvoker`, reverting the implementation of `invokeBlocking` back to the Promise based implementation introduced in PR #88. The deadlock rate for the test is now 0/50. Adding any more to the implementation (e.g. uncommenting the `console.log`) causes the same error as before, but it is completely deterministic: ```sh -- Going to 1024, now at: 264 -- Going to 1024, now at: 265 Assertion failed: (usedDbg_ && "Stats not submitted."), function ~CollectionStats, file HadesGC.cpp, line 259. ``` On my machine it _always_ stops after 265/1024. I am still unsure if this is a problem with production uniff-bindgen-react-native code, an artifact of the test-harness or something weirder. I'm going to suggest that this is addressed in a separate issue.

Fixes #157 Minimal test: 146cd39 The change here is in `UniffiCallInvoker`, reverting the implementation of `invokeBlocking` back to the Promise based implementation introduced in PR #88. The deadlock rate for the test is now 0/50. /cc @tayrevor Adding any more to the implementation (e.g. uncommenting the `console.log`) causes the same error as before, but it is completely deterministic: ```sh -- Going to 1024, now at: 264 -- Going to 1024, now at: 265 Assertion failed: (usedDbg_ && "Stats not submitted."), function ~CollectionStats, file HadesGC.cpp, line 259. ``` On my machine it _always_ stops after 265/1024. I am still unsure if this is a problem with production uniff-bindgen-react-native code, an artifact of the test-harness or something weirder. I'm going to suggest that this is addressed in a separate issue.

jhugman requested a review from zzorba September 5, 2024 16:18

jhugman force-pushed the jhugman/fixup-async-deadlock branch 4 times, most recently from f657971 to 67eb94d Compare September 5, 2024 17:10

jhugman mentioned this pull request Sep 5, 2024

DO NOT MERGE Tests for async-deadlock fix #89

Closed

zzorba approved these changes Sep 5, 2024

View reviewed changes

jhugman force-pushed the jhugman/fixup-async-deadlock branch 4 times, most recently from 422bd95 to d0b5904 Compare September 16, 2024 18:53

jhugman added 3 commits September 17, 2024 17:25

Better comments for async-rust-call

094d817

Add UniffiCallInvoker::invokeNonBlocking

faf0f89

Rewrite to UniffiCallInvoker::invokeBlocking in less heavyweight prim…

86a80c1

…itives

jhugman force-pushed the jhugman/fixup-async-deadlock branch from d0b5904 to 86a80c1 Compare September 17, 2024 16:29

jhugman requested a review from zzorba September 17, 2024 16:44

zzorba approved these changes Sep 17, 2024

View reviewed changes

jhugman merged commit 404b003 into main Sep 17, 2024
1 check passed

jhugman deleted the jhugman/fixup-async-deadlock branch September 17, 2024 17:53

jhugman mentioned this pull request Nov 11, 2024

Fixup callbacks deadlock when calling callback very frequently #158

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixup deadlock when hammering futures of a certain type #88

Fixup deadlock when hammering futures of a certain type #88

jhugman commented Sep 5, 2024 •

edited

Loading

zzorba commented Sep 5, 2024

zzorba left a comment

jhugman commented Sep 6, 2024 •

edited

Loading

jhugman commented Sep 17, 2024

zzorba left a comment

Fixup deadlock when hammering futures of a certain type #88

Fixup deadlock when hammering futures of a certain type #88

Conversation

jhugman commented Sep 5, 2024 • edited Loading

zzorba commented Sep 5, 2024

zzorba left a comment

Choose a reason for hiding this comment

jhugman commented Sep 6, 2024 • edited Loading

jhugman commented Sep 17, 2024

zzorba left a comment

Choose a reason for hiding this comment

jhugman commented Sep 5, 2024 •

edited

Loading

jhugman commented Sep 6, 2024 •

edited

Loading