Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Deadlock with rayon usage #3063

Open
HarukaMa opened this issue Feb 7, 2024 · 7 comments
Open

[Bug] Deadlock with rayon usage #3063

HarukaMa opened this issue Feb 7, 2024 · 7 comments
Labels
bug Incorrect or unexpected behavior

Comments

@HarukaMa
Copy link
Contributor

HarukaMa commented Feb 7, 2024

🐛 Bug Report

There is a rayon-related deadlock in snarkOS, but I'm not quite sure which situation it actually is:

  1. Using rayon parallel iterators while holding a Mutex or write RwLock (this case). See multiple discussions like this and this.
  2. Using rayon with blocking calls (not sure if spawn_blocking applies here). Maybe see this or this.

I think it's probably the first one, as from a deadlock core dump, I did see write lock being acquired while the node stuck at a read lock. Here is the full backtrace of all threads. (Large text file as rayon tend to generate a deep stack. The file is actually .7z but has to be named .zip to upload here.) Notice the thread 69 has the write lock to vm.process while trying to advance a block, while there are many threads trying to validate incoming unconfirmed transactions and needed a read lock.

Steps to Reproduce

Not sure. Run the node with a large number of connections?

Expected Behavior

The node should not deadlock.

Your Environment

@HarukaMa HarukaMa added the bug Incorrect or unexpected behavior label Feb 7, 2024
@ljedrz
Copy link
Collaborator

ljedrz commented Feb 8, 2024

This one feels like it's going to be tricky, but I'll try to investigate it soon.

@raychu86
Copy link
Contributor

We did initial passes, but were unable to reproduce this. Putting this on a lower priority, but will keep and eye out and revisit this.

@ljedrz
Copy link
Collaborator

ljedrz commented Jun 17, 2024

@HarukaMa I've prepared a branch that's aimed at detecting deadlocks; could you try it out with one of your nodes under a workload that's likely to cause a stall, and then provide me with some of its latest logs?

@vicsn
Copy link
Contributor

vicsn commented Sep 5, 2024

Experienced another validator deadlock on a low resourced test network which was spammed with transactions and deployments. Evidence of it being a deadlock was that the validator's process would not terminate after sending a SIGTERM

@raychu86
Copy link
Contributor

Is this still an issue after #3321?

@HarukaMa
Copy link
Contributor Author

personally haven't seen this in months already after several improvement PRs were merged. Can close if the comment above from vicsn is not of any concern.

@vicsn
Copy link
Contributor

vicsn commented Nov 20, 2024

I'm in favour of closing. This issue can be found in github history still for context when any deadlock appears again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect or unexpected behavior
Projects
None yet
Development

No branches or pull requests

4 participants