-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve retry policy #1200
Comments
I'd suggest a bit more compact/human-friendly but explicit syntax for YAML (like
and
|
@peterschmidt85, how would I use the syntax you're suggesting to always retry the job? Do we need other |
I personally, never had a need to restart a job on an error. We could support |
There should be a way to always restart jobs to support running production services. It's in the issue description. |
Ahh, makes sense. Then let's support |
@peterschmidt85, we're currently using pyyaml for YAML parsing that supports YAML 1.1 but not YAML 1.2 (see yaml/pyyaml#555) In YAML 1.1, I suggest we choose a different property name to avoid problems with YAML 1.1 such as retry:
on_events: [no-capacity, interruption]
duration: 1h |
Currently, dstack's retry policy is very limited – it only works for interrupted spot jobs (despite its description that the run is retried on failure). To make retry policy useful, it should cover common use cases, including the following:
The current retry policy specification looks like this:
We should introduce new values for
retry
:retry: always
– always retries the job unless explicitly stoppedretry: no-capacity
– retries on no capacity/interruption but not if the job failedretry: never
– defaultSo
retry_policy
could look like this:To specify different retry policies via CLI, we could allow specifying them in
--retry
:dstack run . --retry=always --retry-duration=1h
The semantics of
duration
should also be changed and clarified. Currently, the duration is calculated from the job submission time. It should be calculated from the last failure time (or job submission time for new jobs) so that retry policy can be used to retry production services.The text was updated successfully, but these errors were encountered: