Add server-side memory of `ComputeServiceRegistrations` that consistently fail jobs? #258

dotsdl · 2024-03-15T23:08:53Z

When a ComputeService is deployed to a problematic compute node, this can cause random or systematic failures of ProtocolDAGs executed on that node. This can swiftly result in Task exhaustion from the server, as the ComputeService consumes and errors out on Tasks in quick succession, leaving healthy ComputeServices to idle.

One mitigation for this is to implement short-term memory within or associated with ComputeServiceRegistrations server-side. As a ComputeService submits completed or errored ProtocolDAGResults to the server, the completion or error could be indicated with addition of either 1 or -1 to a growing list of values.

This list can then be evaluated server-side when the ComputeService attempts to claim new Tasks, perhaps with a weighted sum of values in the list with higher weights on the most recent values and lower weights on the older ones. If the resulting sum is negative, the ComputeService may be denied new attempts to claim until some time expiry is reached, configurable as part of the AlchemiscaleComputeAPI config, with the datetime set as (datetime_denied_attempt() + expiry_seconds). It would then be allowed to claim Tasks again on its first attempt after expiry to redeem itself.

This should slow down task exhaustion substantially, while also giving ComputeServices a chance to recover from temporary issues, such as transient high load on a shared resource.

The text was updated successfully, but these errors were encountered:

dotsdl added component-compute-api component-compute-service component-compute-client labels Mar 15, 2024

dotsdl added this to the Release 0.6.0 - automated strategy execution milestone Mar 15, 2024

dotsdl modified the milestones: Release 0.6.0 - automated strategy execution, Release 0.5.0 - "living networks" and automated strategies enablement Apr 23, 2024

dotsdl mentioned this issue Jun 5, 2024

Add user-settable server-side Task restart policy, per-AlchemicalNetwork #277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add server-side memory of `ComputeServiceRegistrations` that consistently fail jobs? #258

Add server-side memory of `ComputeServiceRegistrations` that consistently fail jobs? #258

dotsdl commented Mar 15, 2024

Add server-side memory of ComputeServiceRegistrations that consistently fail jobs? #258

Add server-side memory of ComputeServiceRegistrations that consistently fail jobs? #258

Comments

dotsdl commented Mar 15, 2024

Add server-side memory of `ComputeServiceRegistrations` that consistently fail jobs? #258

Add server-side memory of `ComputeServiceRegistrations` that consistently fail jobs? #258