You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As many Tasks are executed on compute services running on disparate resources, it's likely that random errors will impact some fraction of the tasks, with some countably-small set of failure modes. Currently, users must examine error tracebacks themselves, then set the Tasks they wish to run again from error to waitingstatus. This can get tedious, and requires many users to babysit their Tasks, even if on rerun many of these will complete successfully.
Instead of this, we would like to empower users with the ability to set a TaskRestartPolicy on an AlchemicalNetwork, which would encode a list giving:
regex pattern of the traceback output to match
max number of retries to perform for matching errors
other options, such as how strongly to avoid a compute service with the same identifying information as one that previously failed on the Task.
Related to #258.
Likely requires #109 to be implemented in some form to periodically apply server-side restarts given the policies set.
The text was updated successfully, but these errors were encountered:
@ianmkenney would you be willing to begin work on this as a head start on the next major milestone? This of high interest for users, so prioritizing it makes sense for us.
As many
Task
s are executed on compute services running on disparate resources, it's likely that random errors will impact some fraction of the tasks, with some countably-small set of failure modes. Currently, users must examine error tracebacks themselves, then set theTask
s they wish to run again fromerror
towaiting
status
. This can get tedious, and requires many users to babysit theirTask
s, even if on rerun many of these will complete successfully.Instead of this, we would like to empower users with the ability to set a
TaskRestartPolicy
on anAlchemicalNetwork
, which would encode a list giving:Task
.Related to #258.
Likely requires #109 to be implemented in some form to periodically apply server-side restarts given the policies set.
The text was updated successfully, but these errors were encountered: