Service deadlock detection triggers when there is no service deadlock #1509

joelarmstrong · 2017-02-13T20:58:33Z

The only state the service deadlock detector has is a) the service jobs that are issued during the last "potential deadlock" and b) the time that the "potential deadlock" occurred.

But the deadlock detector doesn't reset the "potential deadlock" information after it knows a deadlock didn't occur. Here's the situation I run into:

Service job A gets issued. While it's starting, (for e.g. 5 seconds) service job A is the only job issued, so the "potential deadlock" code gets triggered.
A bunch of jobs B get issued that depend on service A. This takes, say, an hour.
The last job of batch B gets completed.
A bunch of jobs C get issued.

The issue is that, between batch B completing and batch C getting issued, even though this may take only a fraction of a second, the service deadlock code can run. If it runs at just the right time, it detects that service job A is the only job issued, just like the last "potential deadlock", and terminates the workflow because it's been an hour since the last "potential deadlock".

I have a fix for this, but I want to double-check that I actually understand the purpose of the service deadlock detection. As of right now I don't think it's too useful, because it works solely on the basis of issued jobs. Consider the situation where two cores are available, and 2 service jobs and 1 regular job are issued, but only the 2 service jobs are running. No forward progress can be made, but currently the code wouldn't detect this as a "service deadlock". Is that intentional?

cket · 2017-02-14T17:33:44Z

Thanks for reporting this - @benedictpaten might have better insight but I don't think this is intentional

jvivian · 2017-02-14T19:02:51Z

I wonder if this is related to the test failing from service deadlock in BD2KGenomics/toil-lib#78

fnothaft · 2017-02-14T19:04:42Z

@jvivian I am not a betting man, but I would take that wager, yes.

joelarmstrong · 2017-02-14T22:17:01Z

Given the timing for that test failure (failure almost exactly 1 minute after service job issued), I think it's probably slightly different:

ip-172-31-23-244 2017-01-20 09:21:00,049 MainThread INFO toil.leader: Issued job 'SparkService' e/D/jobQeeZfK with job batch system ID: 1 and cores: 1, disk: 2.0 G, and memory: 2.0 G
[ 1 minute later]
ip-172-31-23-244 2017-01-20 09:22:02,057 MainThread INFO toil.serviceManager: Waiting for service manager thread to finish ...
ip-172-31-23-244 2017-01-20 09:22:04,166 MainThread INFO toil.serviceManager: ... finished shutting down the service manager. Took 2.10873413086 seconds

What I notice with my issue is a deadlock exception after less than a second of a situation where a service job is the only job issued (because of a bug in the deadlock detector).

From reading that test log it seems to me like the Spark service may have taken more than 60 seconds (the default deadlock time) to start up, for any number of reasons (#1503?). Or maybe it's caused by a different bug in the service manager/deadlock detection?

cket · 2017-02-14T22:36:16Z

@joelarmstrong good call. The time-based deadlock detection is inherently racey so I'll look into better ways to do it but for now it might make sense to just to bump the default deadlock time

cket · 2017-02-24T20:15:13Z

@joelarmstrong just checking in on this - are you working on a PR for this issue?

…ck-detection Fix service deadlock detection (resolves #1509)

Fixes DataBiosphere#1509.

joelarmstrong self-assigned this Feb 13, 2017

ghost added the in progress label Mar 1, 2017

cket closed this as completed in 755e253 Mar 7, 2017

cket added a commit that referenced this issue Mar 7, 2017

Merge pull request #1536 from BD2KGenomics/issues/1509-service-deadlo…

bb82dd5

…ck-detection Fix service deadlock detection (resolves #1509)

ghost removed the in progress label Mar 7, 2017

adderan pushed a commit to adderan/toil that referenced this issue Mar 30, 2017

Fix service deadlock detection

2a5d94e

Fixes DataBiosphere#1509.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service deadlock detection triggers when there is no service deadlock #1509

Service deadlock detection triggers when there is no service deadlock #1509

joelarmstrong commented Feb 13, 2017

cket commented Feb 14, 2017

jvivian commented Feb 14, 2017

fnothaft commented Feb 14, 2017

joelarmstrong commented Feb 14, 2017

cket commented Feb 14, 2017

cket commented Feb 24, 2017

Service deadlock detection triggers when there is no service deadlock #1509

Service deadlock detection triggers when there is no service deadlock #1509

Comments

joelarmstrong commented Feb 13, 2017

cket commented Feb 14, 2017

jvivian commented Feb 14, 2017

fnothaft commented Feb 14, 2017

joelarmstrong commented Feb 14, 2017

cket commented Feb 14, 2017

cket commented Feb 24, 2017