Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible data race in blocking_get{_for} #397

Open
drussel opened this issue Apr 25, 2022 · 10 comments
Open

Possible data race in blocking_get{_for} #397

drussel opened this issue Apr 25, 2022 · 10 comments

Comments

@drussel
Copy link

drussel commented Apr 25, 2022

Some of our tests have been occasionally hanging in stlab::blocking_get when called on (often) already ready futures. Perhaps this is this is due to a race on the call to notify_one. In particular, if the notify_one call is made before the outer thread waits on it, then the thread will not get unblocked. Does this make sense?

That said, I would expect that patching blocking_get to check flag while holding the lock before waiting on the condition variable would fix this, but it does not. So clearly I don't entirely understand the problem.

Going back to the simpler check, sleep implementation of blocking_get does work and and since we only use it in test code, is just fine for us.

Stlab versions 1.6.X. Gcc, clang and vc++.

@drussel
Copy link
Author

drussel commented Apr 25, 2022

It looks like wait should check the condition before waiting, so my theory is not correct.

@sean-parent
Copy link
Member

Are you using the portable executor? There is a potential deadlock with blocking_get() in that case. You have tasks a which will wait on b and b will wait on c

If b starts first, it will wait on c and might steal task a which will wait on b and steal task c but now a and b are deadlocked. I'm going to (or have a new member of my team) reworking the portable task system and blocking_get() to allow spinning up additional threads (this is what GCD and Windows Thread Pool do in this situation - but to a limit).Blocking_get() is never guaranteed unless it is invoked from a thread outside of any thread where dependent tasks are scheduled.

@drussel
Copy link
Author

drussel commented Apr 26, 2022

We aren't using the portable executor (as we use tbb and have a custom executor mapping onto tbb workers). Which makes the code path much simpler and so more perplexing that it hangs sometimes.

On another note, is there any documentation of the portable executor? I couldn't find any.

@sean-parent
Copy link
Member

The tbb thread pool doesn’t scale - so I you block all the threads in the pool you will deadlock. This is surprisingly easy to do. TBB also can’t wake a thread to steel and I don’t know how their steeling algorithm works so if you block you may end up waiting on a task scheduled on the same queue.

@drussel
Copy link
Author

drussel commented Apr 27, 2022

Yes, it looks like it might be something in that direction. The thread being blocked is the main thread, which hasn't had tbb parallel calls and so wasn't thought to be problematic in terms of deadlocking. But having added more logging, the future's work gets stuck somewhere between being submitted to the tbb task group and being executed.

BTW, our motivation for sending the future's work to the tbb thread pool as that we have a lot of underlying code that uses tbb parallel constructs and want to be able to mix that with futures-based code without oversubscribing.

@drussel
Copy link
Author

drussel commented Apr 27, 2022

Anyway, while I'm still flummoxed, it doesn't appear that it is likely to be due to stlab code, and is perhaps more likely to be due to not understanding what tbb is doing underneath.

@sean-parent
Copy link
Member

If you can distill down the basics of what you are doing with the futures I can take another look...

@drussel
Copy link
Author

drussel commented May 4, 2022

We just hit the same issue with the sleep version of blocking get. So while it is dramatically less likely, it is still there. I'll see if I can distill things down to something simpler to reproduce it.

@dabrahams
Copy link
Contributor

@drussel How's that distillation coming along?

@drussel
Copy link
Author

drussel commented Jun 28, 2022

I'd made an attempt that failed to reproduce it. And haven't had a chance to spend more time on it sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants