Improve retry policy #1200

r4victor · 2024-05-07T06:53:42Z

Currently, dstack's retry policy is very limited – it only works for interrupted spot jobs (despite its description that the run is retried on failure). To make retry policy useful, it should cover common use cases, including the following:

When I run a production service, I want to always restart the job if it fails for any reason (possibly with some large duration limit).
When I run a one-time task, I want to retry provisioning only to wait for capacity. I don't want job being restarted if there is a problem with my code.

The current retry policy specification looks like this:

retry_policy:
  retry: true
  duration: 1h

We should introduce new values for retry:

retry: always – always retries the job unless explicitly stopped
retry: no-capacity – retries on no capacity/interruption but not if the job failed
retry: never – default

So retry_policy could look like this:

retry_policy:
  retry: always
  duration: 1h

To specify different retry policies via CLI, we could allow specifying them in --retry:

dstack run . --retry=always --retry-duration=1h

The semantics of duration should also be changed and clarified. Currently, the duration is calculated from the job submission time. It should be calculated from the last failure time (or job submission time for new jobs) so that retry policy can be used to retry production services.

The text was updated successfully, but these errors were encountered:

peterschmidt85 · 2024-05-17T10:20:38Z

I'd suggest a bit more compact/human-friendly but explicit syntax for YAML (like on n GitHub):

retry:
  on: [no-capacity, interruption]
  duration: 1h

and

retry:
  on: no-capacity
  duration: 1h

r4victor · 2024-05-22T12:01:55Z

@peterschmidt85, how would I use the syntax you're suggesting to always retry the job? Do we need other on values besides [no-capacity, interruption] like error?

peterschmidt85 · 2024-05-22T12:14:03Z

@peterschmidt85, how would I use the syntax you're suggesting to always retry the job? Do we need other on values besides [no-capacity, interruption] like error?

I personally, never had a need to restart a job on an error. We could support error later if there is such a need.

r4victor · 2024-05-22T12:25:28Z

There should be a way to always restart jobs to support running production services. It's in the issue description.

peterschmidt85 · 2024-05-22T12:28:05Z

There should be a way to always restart jobs to support running production services. It's in the issue description.

Ahh, makes sense. Then let's support error too.

r4victor · 2024-05-28T10:27:45Z

@peterschmidt85, we're currently using pyyaml for YAML parsing that supports YAML 1.1 but not YAML 1.2 (see yaml/pyyaml#555)

In YAML 1.1, on is interpreted as boolean, so it needs to be quoted when using it as a property name.

I suggest we choose a different property name to avoid problems with YAML 1.1 such as on_events:

retry:
  on_events: [no-capacity, interruption]
  duration: 1h

r4victor added the feature label May 7, 2024

This was referenced May 7, 2024

Rename some of the YAML properties to shorter names #1154

Open

[Roadmap] Q2 2024 #1116

Closed

r4victor self-assigned this May 28, 2024

r4victor mentioned this issue May 29, 2024

Implement new improved retry logic #1282

Merged

r4victor closed this as completed in #1282 May 29, 2024

r4victor mentioned this issue May 29, 2024

[Bug]: Instances that are temporarily not available should be skipped during provisioning a job #1234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve retry policy #1200

Improve retry policy #1200

r4victor commented May 7, 2024 •

edited

Loading

peterschmidt85 commented May 17, 2024

r4victor commented May 22, 2024

peterschmidt85 commented May 22, 2024

r4victor commented May 22, 2024

peterschmidt85 commented May 22, 2024

r4victor commented May 28, 2024

Improve retry policy #1200

Improve retry policy #1200

Comments

r4victor commented May 7, 2024 • edited Loading

peterschmidt85 commented May 17, 2024

r4victor commented May 22, 2024

peterschmidt85 commented May 22, 2024

r4victor commented May 22, 2024

peterschmidt85 commented May 22, 2024

r4victor commented May 28, 2024

r4victor commented May 7, 2024 •

edited

Loading