Skip to content

Commit

Permalink
docs: describe containerset retrystrategy and verify it works. Fixes: #…
Browse files Browse the repository at this point in the history
…11502 (#12809)

Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun tian <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]>
  • Loading branch information
3 people authored Apr 4, 2024
1 parent f521c30 commit 459b09d
Show file tree
Hide file tree
Showing 13 changed files with 206 additions and 16 deletions.
5 changes: 3 additions & 2 deletions api/jsonschema/schema.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 3 additions & 2 deletions api/openapi-spec/swagger.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

70 changes: 70 additions & 0 deletions docs/container-set-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,3 +116,73 @@ Example B: Lopsided requests, e.g. `a -> b` where `a` is cheap and `b` is expens
Can you see the problem here? `a` only has small requests, but the container set will use the total of all requests. So it's as if you're using all that GPU for 10h. This will be expensive.

Solution: do not use container set when you have lopsided requests.

## Inner `retryStrategy` usage

> v3.3 and after

You can set an inner `retryStrategy` to apply to all containers of a container set, including the `duration` between each retry and the total number of `retries`.

See an example below:

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: containerset-with-retrystrategy
annotations:
workflows.argoproj.io/description: |
This workflow creates a container set with a retryStrategy.
spec:
entrypoint: containerset-retrystrategy-example
templates:
- name: containerset-retrystrategy-example
containerSet:
retryStrategy:
retries: "10" # if fails, retry at most ten times
duration: 30s # retry for at most 30s
containers:
# this container completes successfully, so it won't be retried.
- name: success
image: python:alpine3.6
command:
- python
- -c
args:
- |
print("hi")
# if fails, it will retry at most ten times.
- name: fail-retry
image: python:alpine3.6
command: ["python", -c]
# fail with a 66% probability
args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]
```

<!-- markdownlint-disable MD046 -- allow indentation within the admonition -->

!!! Note "Template-level `retryStrategy` vs Container Set `retryStrategy`"
`containerSet.retryStrategy` works differently from [template-level retries](retries.md):

1. Your `command` will be re-ran by the Executor inside the same container if it fails.

- As no new containers are created, the nodes in the UI remain the same, and the retried logs are appended to original container's logs. For example, your container logs may look like:
```text
time="2024-03-29T06:40:25 UTC" level=info msg="capturing logs" argo=true
intentional failure
time="2024-03-29T06:40:25 UTC" level=debug msg="ignore signal child exited" argo=true
time="2024-03-29T06:40:26 UTC" level=info msg="capturing logs" argo=true
time="2024-03-29T06:40:26 UTC" level=debug msg="ignore signal urgent I/O condition" argo=true
intentional failure
time="2024-03-29T06:40:26 UTC" level=debug msg="ignore signal child exited" argo=true
time="2024-03-29T06:40:26 UTC" level=debug msg="forwarding signal terminated" argo=true
time="2024-03-29T06:40:27 UTC" level=info msg="sub-process exited" argo=true error="<nil>"
time="2024-03-29T06:40:27 UTC" level=info msg="not saving outputs - not main container" argo=true
Error: exit status 1
```

1. If a container's `command` cannot be located, it will not be retried.

- As it will fail each time, the retry logic is short-circuited.

<!-- markdownlint-enable MD046 -->
3 changes: 3 additions & 0 deletions docs/executor_swagger.md
Original file line number Diff line number Diff line change
Expand Up @@ -1026,10 +1026,13 @@ referred to by services.
### <span id="container-set-retry-strategy"></span> ContainerSetRetryStrategy


> ContainerSetRetryStrategy provides controls on how to retry a container set





**Properties**

| Name | Type | Go type | Required | Default | Description | Example |
Expand Down
6 changes: 3 additions & 3 deletions docs/fields.md
Original file line number Diff line number Diff line change
Expand Up @@ -2411,7 +2411,7 @@ _No description available_
| Field Name | Field Type | Description |
|:----------:|:----------:|---------------|
|`containers`|`Array<`[`ContainerNode`](#containernode)`>`|_No description available_|
|`retryStrategy`|[`ContainerSetRetryStrategy`](#containersetretrystrategy)|RetryStrategy describes how to retry a container nodes in the container set if it fails. Nbr of retries(default 0) and sleep duration between retries(default 0s, instant retry) can be set.|
|`retryStrategy`|[`ContainerSetRetryStrategy`](#containersetretrystrategy)|RetryStrategy describes how to retry container nodes if the container set fails. Note that this works differently from the template-level `retryStrategy` as it is a process-level retry that does not create new Pods or containers.|
|`volumeMounts`|`Array<`[`VolumeMount`](#volumemount)`>`|_No description available_|

## DAGTemplate
Expand Down Expand Up @@ -3748,7 +3748,7 @@ _No description available_

## ContainerSetRetryStrategy

_No description available_
ContainerSetRetryStrategy provides controls on how to retry a container set

<details markdown>
<summary>Examples with this field (click to open)</summary>
Expand Down Expand Up @@ -3780,7 +3780,7 @@ _No description available_
| Field Name | Field Type | Description |
|:----------:|:----------:|---------------|
|`duration`|`string`|Duration is the time between each retry, examples values are "300ms", "1s" or "5m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".|
|`retries`|[`IntOrString`](#intorstring)|Nbr of retries|
|`retries`|[`IntOrString`](#intorstring)|Retries is the maximum number of retry attempts for each container. It does not include the first, original attempt; the maximum number of total attempts will be `retries + 1`.|

## DAGTask

Expand Down
8 changes: 5 additions & 3 deletions pkg/apis/workflow/v1alpha1/container_set_template_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,18 @@ import (
type ContainerSetTemplate struct {
Containers []ContainerNode `json:"containers" protobuf:"bytes,4,rep,name=containers"`
VolumeMounts []corev1.VolumeMount `json:"volumeMounts,omitempty" protobuf:"bytes,3,rep,name=volumeMounts"`
// RetryStrategy describes how to retry a container nodes in the container set if it fails.
// Nbr of retries(default 0) and sleep duration between retries(default 0s, instant retry) can be set.
// RetryStrategy describes how to retry container nodes if the container set fails.
// Note that this works differently from the template-level `retryStrategy` as it is a process-level retry that does not create new Pods or containers.
RetryStrategy *ContainerSetRetryStrategy `json:"retryStrategy,omitempty" protobuf:"bytes,5,opt,name=retryStrategy"`
}

// ContainerSetRetryStrategy provides controls on how to retry a container set
type ContainerSetRetryStrategy struct {
// Duration is the time between each retry, examples values are "300ms", "1s" or "5m".
// Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
Duration string `json:"duration,omitempty" protobuf:"bytes,1,opt,name=duration"`
// Nbr of retries
// Retries is the maximum number of retry attempts for each container. It does not include the
// first, original attempt; the maximum number of total attempts will be `retries + 1`.
Retries *intstr.IntOrString `json:"retries" protobuf:"bytes,2,rep,name=retries"`
}

Expand Down
8 changes: 5 additions & 3 deletions pkg/apis/workflow/v1alpha1/generated.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 4 additions & 3 deletions pkg/apis/workflow/v1alpha1/openapi_generated.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions pkg/plugins/executor/swagger.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1002,6 +1002,8 @@ definitions:
title: ContainerPort represents a network port in a single container.
type: object
ContainerSetRetryStrategy:
description: ContainerSetRetryStrategy provides controls on how to retry a container
set
properties:
duration:
description: |-
Expand Down

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

74 changes: 74 additions & 0 deletions test/e2e/retry_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@
package e2e

import (
"context"
"io"
"strings"
"testing"
"time"

Expand Down Expand Up @@ -120,6 +123,77 @@ spec:
})
}

func (s *RetryTestSuite) TestWorkflowTemplateWithRetryStrategyInContainerSet() {
var name string
var ns string
s.Given().
WorkflowTemplate("@testdata/workflow-template-with-containerset.yaml").
Workflow(`
metadata:
name: workflow-template-containerset
spec:
workflowTemplateRef:
name: containerset-with-retrystrategy
`).
When().
CreateWorkflowTemplates().
SubmitWorkflow().
WaitForWorkflow(fixtures.ToBeFailed).
Then().
ExpectWorkflow(func(t *testing.T, metadata *metav1.ObjectMeta, status *wfv1.WorkflowStatus) {
assert.Equal(t, status.Phase, wfv1.WorkflowFailed)
}).
ExpectWorkflowNode(func(status v1alpha1.NodeStatus) bool {
return status.Name == "workflow-template-containerset"
}, func(t *testing.T, status *v1alpha1.NodeStatus, pod *apiv1.Pod) {
name = pod.GetName()
ns = pod.GetNamespace()
})
// Success, no need retry
s.Run("ContainerLogs", func() {
ctx := context.Background()
podLogOptions := &apiv1.PodLogOptions{Container: "c1"}
stream, err := s.KubeClient.CoreV1().Pods(ns).GetLogs(name, podLogOptions).Stream(ctx)
assert.Nil(s.T(), err)
defer stream.Close()
logBytes, err := io.ReadAll(stream)
assert.Nil(s.T(), err)
output := string(logBytes)
count := strings.Count(output, "capturing logs")
assert.Equal(s.T(), 1, count)
assert.Contains(s.T(), output, "hi")
})
// Command err. No retry logic is entered.
s.Run("ContainerLogs", func() {
ctx := context.Background()
podLogOptions := &apiv1.PodLogOptions{Container: "c2"}
stream, err := s.KubeClient.CoreV1().Pods(ns).GetLogs(name, podLogOptions).Stream(ctx)
assert.Nil(s.T(), err)
defer stream.Close()
logBytes, err := io.ReadAll(stream)
assert.Nil(s.T(), err)
output := string(logBytes)
count := strings.Count(output, "capturing logs")
assert.Equal(s.T(), 0, count)
assert.Contains(s.T(), output, "executable file not found in $PATH")
})
// Retry when err.
s.Run("ContainerLogs", func() {
ctx := context.Background()
podLogOptions := &apiv1.PodLogOptions{Container: "c3"}
stream, err := s.KubeClient.CoreV1().Pods(ns).GetLogs(name, podLogOptions).Stream(ctx)
assert.Nil(s.T(), err)
defer stream.Close()
logBytes, err := io.ReadAll(stream)
assert.Nil(s.T(), err)
output := string(logBytes)
count := strings.Count(output, "capturing logs")
assert.Equal(s.T(), 2, count)
countFailureInfo := strings.Count(output, "intentional failure")
assert.Equal(s.T(), 2, countFailureInfo)
})
}

func TestRetrySuite(t *testing.T) {
suite.Run(t, new(RetryTestSuite))
}
32 changes: 32 additions & 0 deletions test/e2e/testdata/workflow-template-with-containerset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: containerset-with-retrystrategy
annotations:
workflows.argoproj.io/description: |
This workflow creates a container set with a retryStrategy.
spec:
entrypoint: test
templates:
- name: test
containerSet:
retryStrategy:
retries: "2"
containers:
- name: c1
image: python:alpine3.6
command:
- python
- -c
args:
- |
print("hi")
- name: c2
image: python:alpine3.6
command:
- invalid
- command
- name: c3
image: alpine:latest
command: [ sh, -c ]
args: [ "echo intentional failure; exit 1" ]

0 comments on commit 459b09d

Please sign in to comment.