-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflows that failed before upgrade to 3.5.6 fail to retry #13003
Comments
I tested this purely on 3.5.6, and it fails if you attempt to retry this deliberately broken dag diamond. apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: dag-diamond-
spec:
entrypoint: diamond
templates:
- name: diamond
dag:
tasks:
- name: A
template: echo
arguments:
parameters: [{name: message, value: A}]
- name: B
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: B}]
- name: C
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: C}]
- name: D
depends: "B && C"
template: eacho
arguments:
parameters: [{name: message, value: D}]
- name: echo
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo, "{{inputs.parameters.message}}"]
- name: eacho
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [eacho, "{{inputs.parameters.message}}"] A link to the slack discussion: https://cloud-native.slack.com/archives/C01QW9QSSSK/p1714641906410049 |
Another piece of info: |
For me, the workflow above that reproduces the issue on 3.5.6 doesn't reproduce it on 3.5.5. |
Ignore that last comment, it doesn't go wrong for me in a really basic workflows installation at all. 3.5.6 will retry happily there. I'll try and determine what the difference is with our production 3.5.6 and why it only fails there. |
Our production has a metadata:
generateName: dag-diamond-
spec:
entrypoint: diamond
templates:
- name: diamond
retryStrategy:
limit: 2
retryPolicy: OnError
dag:
tasks:
- name: A
template: echo
arguments:
parameters: [{name: message, value: A}]
- name: B
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: B}]
- name: C
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: C}]
- name: D
depends: "B && C"
template: eacho
arguments:
parameters: [{name: message, value: D}]
- name: echo
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo, "{{inputs.parameters.message}}"]
- name: eacho
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [eacho, "{{inputs.parameters.message}}"] This works correctly with 3.5.5 |
This was broken by #12817. |
I feel like this has got to be related to the root cause I mentioned in #12817 (review). Although the PR itself did not touch (automated) retry nodes. The manual retry logic needs a refactor in general. We should also add all these failing test cases |
I do think the retry node needs to be skipped when checking if the descendants have success nodes since it is virtual. |
To clarify, this will happen even when no retry was needed, correct? or does it only occur if a retry is triggered? |
This requires both a retryStrategy and a manual retry attempt, but the retryStrategy does not need to have been used, we just need the retry virtual node to be present. I don't believe the actual retryStrategy matters at all. |
…13003 (#13004) Signed-off-by: shuangkun <[email protected]>
…13003 (#13004) Signed-off-by: shuangkun <[email protected]> (cherry picked from commit 71f1d86)
…rgoproj#13003 (argoproj#13004) Signed-off-by: shuangkun <[email protected]>
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what did you expect to happen?
We were on 3.5.5
Some workflows failed
Upgraded to 3.5.6
Retried some of the workflows
Result:
Reproducible on any workflow
Version
v3.5.6
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
any workflow reproduces it
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: