Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service deadlock detection triggers when there is no service deadlock #1509

Closed
joelarmstrong opened this issue Feb 13, 2017 · 6 comments
Closed
Assignees

Comments

@joelarmstrong
Copy link
Contributor

The only state the service deadlock detector has is a) the service jobs that are issued during the last "potential deadlock" and b) the time that the "potential deadlock" occurred.

But the deadlock detector doesn't reset the "potential deadlock" information after it knows a deadlock didn't occur. Here's the situation I run into:

  • Service job A gets issued. While it's starting, (for e.g. 5 seconds) service job A is the only job issued, so the "potential deadlock" code gets triggered.
  • A bunch of jobs B get issued that depend on service A. This takes, say, an hour.
  • The last job of batch B gets completed.
  • A bunch of jobs C get issued.

The issue is that, between batch B completing and batch C getting issued, even though this may take only a fraction of a second, the service deadlock code can run. If it runs at just the right time, it detects that service job A is the only job issued, just like the last "potential deadlock", and terminates the workflow because it's been an hour since the last "potential deadlock".

I have a fix for this, but I want to double-check that I actually understand the purpose of the service deadlock detection. As of right now I don't think it's too useful, because it works solely on the basis of issued jobs. Consider the situation where two cores are available, and 2 service jobs and 1 regular job are issued, but only the 2 service jobs are running. No forward progress can be made, but currently the code wouldn't detect this as a "service deadlock". Is that intentional?

@joelarmstrong joelarmstrong self-assigned this Feb 13, 2017
@cket
Copy link
Contributor

cket commented Feb 14, 2017

Thanks for reporting this - @benedictpaten might have better insight but I don't think this is intentional

@jvivian
Copy link
Contributor

jvivian commented Feb 14, 2017

I wonder if this is related to the test failing from service deadlock in BD2KGenomics/toil-lib#78

@fnothaft
Copy link
Contributor

@jvivian I am not a betting man, but I would take that wager, yes.

@joelarmstrong
Copy link
Contributor Author

Given the timing for that test failure (failure almost exactly 1 minute after service job issued), I think it's probably slightly different:

ip-172-31-23-244 2017-01-20 09:21:00,049 MainThread INFO toil.leader: Issued job 'SparkService' e/D/jobQeeZfK with job batch system ID: 1 and cores: 1, disk: 2.0 G, and memory: 2.0 G
[ 1 minute later]
ip-172-31-23-244 2017-01-20 09:22:02,057 MainThread INFO toil.serviceManager: Waiting for service manager thread to finish ...
ip-172-31-23-244 2017-01-20 09:22:04,166 MainThread INFO toil.serviceManager: ... finished shutting down the service manager. Took 2.10873413086 seconds

What I notice with my issue is a deadlock exception after less than a second of a situation where a service job is the only job issued (because of a bug in the deadlock detector).

From reading that test log it seems to me like the Spark service may have taken more than 60 seconds (the default deadlock time) to start up, for any number of reasons (#1503?). Or maybe it's caused by a different bug in the service manager/deadlock detection?

@cket
Copy link
Contributor

cket commented Feb 14, 2017

@joelarmstrong good call. The time-based deadlock detection is inherently racey so I'll look into better ways to do it but for now it might make sense to just to bump the default deadlock time

@cket
Copy link
Contributor

cket commented Feb 24, 2017

@joelarmstrong just checking in on this - are you working on a PR for this issue?

@ghost ghost added the in progress label Mar 1, 2017
@cket cket closed this as completed in 755e253 Mar 7, 2017
cket added a commit that referenced this issue Mar 7, 2017
…ck-detection

Fix service deadlock detection (resolves #1509)
@ghost ghost removed the in progress label Mar 7, 2017
adderan pushed a commit to adderan/toil that referenced this issue Mar 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants