Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve retry policy #1200

Closed
r4victor opened this issue May 7, 2024 · 6 comments · Fixed by #1282
Closed

Improve retry policy #1200

r4victor opened this issue May 7, 2024 · 6 comments · Fixed by #1282
Assignees
Labels

Comments

@r4victor
Copy link
Collaborator

r4victor commented May 7, 2024

Currently, dstack's retry policy is very limited – it only works for interrupted spot jobs (despite its description that the run is retried on failure). To make retry policy useful, it should cover common use cases, including the following:

  1. When I run a production service, I want to always restart the job if it fails for any reason (possibly with some large duration limit).
  2. When I run a one-time task, I want to retry provisioning only to wait for capacity. I don't want job being restarted if there is a problem with my code.

The current retry policy specification looks like this:

retry_policy:
  retry: true
  duration: 1h

We should introduce new values for retry:

  • retry: always – always retries the job unless explicitly stopped
  • retry: no-capacity – retries on no capacity/interruption but not if the job failed
  • retry: never – default

So retry_policy could look like this:

retry_policy:
  retry: always
  duration: 1h

To specify different retry policies via CLI, we could allow specifying them in --retry:

dstack run . --retry=always --retry-duration=1h

The semantics of duration should also be changed and clarified. Currently, the duration is calculated from the job submission time. It should be calculated from the last failure time (or job submission time for new jobs) so that retry policy can be used to retry production services.

@peterschmidt85
Copy link
Contributor

I'd suggest a bit more compact/human-friendly but explicit syntax for YAML (like on n GitHub):

retry:
  on: [no-capacity, interruption]
  duration: 1h

and

retry:
  on: no-capacity
  duration: 1h

@r4victor
Copy link
Collaborator Author

@peterschmidt85, how would I use the syntax you're suggesting to always retry the job? Do we need other on values besides [no-capacity, interruption] like error?

@peterschmidt85
Copy link
Contributor

@peterschmidt85, how would I use the syntax you're suggesting to always retry the job? Do we need other on values besides [no-capacity, interruption] like error?

I personally, never had a need to restart a job on an error. We could support error later if there is such a need.

@r4victor
Copy link
Collaborator Author

There should be a way to always restart jobs to support running production services. It's in the issue description.

@peterschmidt85
Copy link
Contributor

There should be a way to always restart jobs to support running production services. It's in the issue description.

Ahh, makes sense. Then let's support error too.

@r4victor r4victor self-assigned this May 28, 2024
@r4victor
Copy link
Collaborator Author

@peterschmidt85, we're currently using pyyaml for YAML parsing that supports YAML 1.1 but not YAML 1.2 (see yaml/pyyaml#555)

In YAML 1.1, on is interpreted as boolean, so it needs to be quoted when using it as a property name.

I suggest we choose a different property name to avoid problems with YAML 1.1 such as on_events:

retry:
  on_events: [no-capacity, interruption]
  duration: 1h

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants