Add user-settable server-side `Task` restart policy, per-`AlchemicalNetwork` #277

dotsdl · 2024-06-05T04:11:28Z

As many Tasks are executed on compute services running on disparate resources, it's likely that random errors will impact some fraction of the tasks, with some countably-small set of failure modes. Currently, users must examine error tracebacks themselves, then set the Tasks they wish to run again from error to waiting status. This can get tedious, and requires many users to babysit their Tasks, even if on rerun many of these will complete successfully.

Instead of this, we would like to empower users with the ability to set a TaskRestartPolicy on an AlchemicalNetwork, which would encode a list giving:

regex pattern of the traceback output to match
max number of retries to perform for matching errors
other options, such as how strongly to avoid a compute service with the same identifying information as one that previously failed on the Task.

Related to #258.
Likely requires #109 to be implemented in some form to periodically apply server-side restarts given the policies set.

The text was updated successfully, but these errors were encountered:

dotsdl · 2024-06-05T04:12:28Z

Thanks to @JenkeScheen for raising this issue in today's user group meeting!

dotsdl · 2024-06-13T15:13:30Z

@ianmkenney would you be willing to begin work on this as a head start on the next major milestone? This of high interest for users, so prioritizing it makes sense for us.

dotsdl · 2024-07-12T04:36:05Z

@ianmkenney can you link your design doc here?

ianmkenney · 2024-07-12T16:46:08Z

Here is the link to the design doc.

dotsdl added the priority-high label Jun 5, 2024

dotsdl added this to the Release 0.6.0 - "living networks" and automated strategies enablement milestone Jun 5, 2024

dotsdl added component-user-api component-user-client component-statestore user-story labels Jun 5, 2024

dotsdl added this to alchemiscale : Phase 3 - Folding@Home, new features, optimizations, targeted refactors Jun 13, 2024

dotsdl moved this to Sprint - Available in alchemiscale : Phase 3 - Folding@Home, new features, optimizations, targeted refactors Jun 13, 2024

dotsdl assigned ianmkenney Jun 13, 2024

dotsdl moved this from Sprint - Available to Sprint - In Progress in alchemiscale : Phase 3 - Folding@Home, new features, optimizations, targeted refactors Jun 25, 2024

ianmkenney linked a pull request Jul 16, 2024 that will close this issue

Implement task restart policies #280

Open

dotsdl moved this to Sprint - In Progress in alchemiscale : advancement sprints Sep 16, 2024

dotsdl added this to alchemiscale : advancement sprints Sep 16, 2024

dotsdl modified the milestones: Release 0.7.0 - "living networks" and automated strategies enablement, Release 0.6.0 - result retrieval optimizations, server-side task restart policies Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add user-settable server-side `Task` restart policy, per-`AlchemicalNetwork` #277

Add user-settable server-side `Task` restart policy, per-`AlchemicalNetwork` #277

dotsdl commented Jun 5, 2024

dotsdl commented Jun 5, 2024

dotsdl commented Jun 13, 2024

dotsdl commented Jul 12, 2024

ianmkenney commented Jul 12, 2024

Add user-settable server-side Task restart policy, per-AlchemicalNetwork #277

Add user-settable server-side Task restart policy, per-AlchemicalNetwork #277

Comments

dotsdl commented Jun 5, 2024

dotsdl commented Jun 5, 2024

dotsdl commented Jun 13, 2024

dotsdl commented Jul 12, 2024

ianmkenney commented Jul 12, 2024

Add user-settable server-side `Task` restart policy, per-`AlchemicalNetwork` #277

Add user-settable server-side `Task` restart policy, per-`AlchemicalNetwork` #277