-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service deadlock detection triggers when there is no service deadlock #1509
Comments
Thanks for reporting this - @benedictpaten might have better insight but I don't think this is intentional |
I wonder if this is related to the test failing from service deadlock in BD2KGenomics/toil-lib#78 |
@jvivian I am not a betting man, but I would take that wager, yes. |
Given the timing for that test failure (failure almost exactly 1 minute after service job issued), I think it's probably slightly different:
What I notice with my issue is a deadlock exception after less than a second of a situation where a service job is the only job issued (because of a bug in the deadlock detector). From reading that test log it seems to me like the Spark service may have taken more than 60 seconds (the default deadlock time) to start up, for any number of reasons (#1503?). Or maybe it's caused by a different bug in the service manager/deadlock detection? |
@joelarmstrong good call. The time-based deadlock detection is inherently racey so I'll look into better ways to do it but for now it might make sense to just to bump the default deadlock time |
@joelarmstrong just checking in on this - are you working on a PR for this issue? |
…ck-detection Fix service deadlock detection (resolves #1509)
The only state the service deadlock detector has is a) the service jobs that are issued during the last "potential deadlock" and b) the time that the "potential deadlock" occurred.
But the deadlock detector doesn't reset the "potential deadlock" information after it knows a deadlock didn't occur. Here's the situation I run into:
The issue is that, between batch B completing and batch C getting issued, even though this may take only a fraction of a second, the service deadlock code can run. If it runs at just the right time, it detects that service job A is the only job issued, just like the last "potential deadlock", and terminates the workflow because it's been an hour since the last "potential deadlock".
I have a fix for this, but I want to double-check that I actually understand the purpose of the service deadlock detection. As of right now I don't think it's too useful, because it works solely on the basis of issued jobs. Consider the situation where two cores are available, and 2 service jobs and 1 regular job are issued, but only the 2 service jobs are running. No forward progress can be made, but currently the code wouldn't detect this as a "service deadlock". Is that intentional?
The text was updated successfully, but these errors were encountered: