Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixup deadlock when hammering futures of a certain type #88

Merged
merged 3 commits into from
Sep 17, 2024

Conversation

jhugman
Copy link
Owner

@jhugman jhugman commented Sep 5, 2024

According to The Big O of Code Reviews, this is a O(n) change.

This PR moves the polling of futures to outside of the continuation callback; this has shown to be a source of deadlocks.

      // From https://github.com/mozilla/uniffi-rs/pull/1837/files#diff-8a28c9cf1245b4f714d406ea4044d68e1000099928eaca1afb504ccbc008fe9fR35-R37
      //
      // > WARNING: the call to [rust_future_poll] must be scheduled to happen soon after the callback is
      // > called, but not inside the callback itself.  If [rust_future_poll] is called inside the
      // > callback, some futures will deadlock and our scheduler code might as well.
      //
      // This delay is to ensure that `uniffiFutureContinuationCallback` returns before the next poll, i.e.
      // so that the next poll is outside of this callback.
      //
      // The length of the delay seems to be significant (at least in tests which hammer a network).
      // I would like to understand this more: I am still seeing deadlocks when this drops below its current
      // delay, but these maybe related to a different issue, as alluded to in
      // https://github.com/mozilla/uniffi-rs/pull/1901

Unfortunately, the tests for this aren't likely to end up in this repo, since it relies on a fork of the matrix-sdk.

As said in the comment, I'm don't think this has fixed all possible deadlocks, but I'm hoping this has at least fixed the proximal problem.

Fixes #89

@jhugman jhugman requested a review from zzorba September 5, 2024 16:18
@jhugman jhugman force-pushed the jhugman/fixup-async-deadlock branch 4 times, most recently from f657971 to 67eb94d Compare September 5, 2024 17:10
@zzorba
Copy link
Collaborator

zzorba commented Sep 5, 2024

Thank you for the fix, it does work on our branch and the deadlock is gone now.

Though the fact that 100ms is the threshhold it starts working at makes me worry a little bit that we've just worked around the call?

Does the extra poll need to be applied each time, or is it the first time that matters the most (I thought the uniffi task was saying that it was related to setup + poll being too close to each other?

Copy link
Collaborator

@zzorba zzorba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve as a viable workaround, though I'd love to better understand the 'why' if possible.

@jhugman
Copy link
Owner Author

jhugman commented Sep 6, 2024

My understanding is that the poll is protected by a mutex.

A call to the poll on the Rust side:

  • takes the mutex lock. This is not a re-entrant mutex.
  • does some work
  • call into the JS uniffiFutureContinuationCallback with an update on the status of the task.
  • then releases the mutex lock.

Before this change, the uniffiFutureContinuationCallback would:

  • resolve the promise that the poll loop is awaiting on.
  • the polling loop either ends, and the completeFunc is called OR:
  • calls back into the Rust poll, which would then try to acquire the lock currently held.
  • because of this deadlock, the uniffiFutureContinuationCallback never returns to release that lock.

The fix makes the work done in uniffiFutureContinuationCallback just a setTimeout to schedule the next poll, instead of the poll itself. The call to setTimeout returns immediately, so uniffiFutureContinuationCallback can return and unlock the mutex before the next poll.

This analysis might suggest that we could be reliably reproduced by setting up a situation where the pollResult is MAYBE_READY.

What I can't explain, if the above is true, but can hazard a guess at:

  • what is the significance of the 100ms? Why can't this be 0, i.e. just use setImmediate?

At this point, it's worth pointing out:

  • the test-harness implements a rudimentary event loop. There's setTimeout and friends I lifted from @tmikov's excellent hermes-jsi-demos. On top of that, I have implemented a CallInvoker which stands in for the react::CallInvoker. I have not verified that the behaviour of the event loop/micro tasks etc is standards compliant, or matches how react-native does it.
  • the build process is emitting very old style JS. It's still using Metro, but there is a tsc step before it. This would suggest that the JS being executed in test may be quite different to the JS running on device. Is this a problem? I don't think so, but it is a source of code difference, so perhaps.

@zzorba Would you be able to change the 100 to 1 or zero, and try it out within React Native?

What still confuses me: if the above is true—

  • why don't we see this every time a MAYBE_READY is returned?
  • why do we still see the occasional deadlock during test? Will we see the occasional deadlock on device? Are there other deadlocks lurking in this code.

Eyeballing uniffi's future.rs suggests there are two Mutexes per future. The principle one we've been discussing is the scheduler one, but perhaps there are some ordering of locking and unlocking that the JS code is getting it into. More research needed.

@jhugman jhugman force-pushed the jhugman/fixup-async-deadlock branch 4 times, most recently from 422bd95 to d0b5904 Compare September 16, 2024 18:53
@jhugman jhugman force-pushed the jhugman/fixup-async-deadlock branch from d0b5904 to 86a80c1 Compare September 17, 2024 16:29
@jhugman
Copy link
Owner Author

jhugman commented Sep 17, 2024

The continuation callback needed to be called and done by the time next poll happened.

Tests for this are in #89 . It is to load the same image 2, 4, 8, 16, 32, 64, 128, 256, 512, 512 times.

Things I tried:

  • I rewrote UniffiCallInvoker::invokeBlocking to use lighter-weight primitives. This increased the success rate of completion, but still getting deadlocks most of the time (about 3 out of 7 runs).
  • I tried to schedule the next poll on to the next tick of the event loop. Again, this increased the success rate to about 5 out of 10 runs.
  • Adding a UniffiCallInvoker::invokeNonBlocking which just adds a callback call to the event queue. This increased the success rate to 10/10. I added timings at this point. loading 512 images took ~40 seconds. 1024 images took ~600 seconds.
  • This is definitely super-linear. Are we polling too much? Is the event queue just getting flooded?
  • Removing next tick work reduced the time to load 512 images to ~7 seconds, and 1024 images ~15 seconds.

I think am ready to declare victory on this now.

In the end, I think only faf0f89 is needed here. There is almost no changes in async-rust-call.ts.

I'm going to ask @zzorba for another review.

@jhugman jhugman requested a review from zzorba September 17, 2024 16:44
Copy link
Collaborator

@zzorba zzorba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, confirmed the fix in the full app, its working great.

@jhugman jhugman merged commit 404b003 into main Sep 17, 2024
1 check passed
@jhugman jhugman deleted the jhugman/fixup-async-deadlock branch September 17, 2024 17:53
jhugman added a commit that referenced this pull request Nov 11, 2024
The change here is in `UniffiCallInvoker`, reverting the implementation of `invokeBlocking` back to the Promise based implementation introduced in PR #88.

The deadlock rate for the test is now 0/50.

Adding any more to the implementation (e.g. uncommenting the `console.log`) causes the same error as before, but it is completely deterministic:

```sh
-- Going to 1024, now at: 264
-- Going to 1024, now at: 265
Assertion failed: (usedDbg_ && "Stats not submitted."), function ~CollectionStats, file HadesGC.cpp, line 259.
```

On my machine it _always_ stops after 265/1024. I am still unsure if this is a problem with production uniff-bindgen-react-native code, an artifact of the test-harness or something weirder.

I'm going to suggest that this is addressed in a separate issue.
jhugman added a commit that referenced this pull request Nov 11, 2024
The change here is in `UniffiCallInvoker`, reverting the implementation of `invokeBlocking` back to the Promise based implementation introduced in PR #88.

The deadlock rate for the test is now 0/50.

Adding any more to the implementation (e.g. uncommenting the `console.log`) causes the same error as before, but it is completely deterministic:

```sh
-- Going to 1024, now at: 264
-- Going to 1024, now at: 265
Assertion failed: (usedDbg_ && "Stats not submitted."), function ~CollectionStats, file HadesGC.cpp, line 259.
```

On my machine it _always_ stops after 265/1024. I am still unsure if this is a problem with production uniff-bindgen-react-native code, an artifact of the test-harness or something weirder.

I'm going to suggest that this is addressed in a separate issue.
jhugman added a commit that referenced this pull request Nov 12, 2024
Fixes #157

Minimal test: 146cd39

The change here is in `UniffiCallInvoker`, reverting the implementation
of `invokeBlocking` back to the Promise based implementation introduced
in PR #88.

The deadlock rate for the test is now 0/50.

/cc @tayrevor

Adding any more to the implementation (e.g. uncommenting the
`console.log`) causes the same error as before, but it is completely
deterministic:

```sh
-- Going to 1024, now at: 264
-- Going to 1024, now at: 265
Assertion failed: (usedDbg_ && "Stats not submitted."), function ~CollectionStats, file HadesGC.cpp, line 259.
```

On my machine it _always_ stops after 265/1024. I am still unsure if
this is a problem with production uniff-bindgen-react-native code, an
artifact of the test-harness or something weirder.

I'm going to suggest that this is addressed in a separate issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants