Add server-side memory of ComputeServiceRegistrations
that consistently fail jobs?
#258
Milestone
ComputeServiceRegistrations
that consistently fail jobs?
#258
When a
ComputeService
is deployed to a problematic compute node, this can cause random or systematic failures ofProtocolDAG
s executed on that node. This can swiftly result inTask
exhaustion from the server, as theComputeService
consumes and errors out onTask
s in quick succession, leaving healthyComputeService
s to idle.One mitigation for this is to implement short-term memory within or associated with
ComputeServiceRegistration
s server-side. As aComputeService
submits completed or erroredProtocolDAGResult
s to the server, the completion or error could be indicated with addition of either1
or-1
to a growing list of values.This list can then be evaluated server-side when the
ComputeService
attempts to claim newTask
s, perhaps with a weighted sum of values in the list with higher weights on the most recent values and lower weights on the older ones. If the resulting sum is negative, theComputeService
may be denied new attempts to claim until some time expiry is reached, configurable as part of theAlchemiscaleComputeAPI
config, with thedatetime
set as(datetime_denied_attempt() + expiry_seconds)
. It would then be allowed to claimTask
s again on its first attempt after expiry to redeem itself.This should slow down task exhaustion substantially, while also giving
ComputeService
s a chance to recover from temporary issues, such as transient high load on a shared resource.The text was updated successfully, but these errors were encountered: