Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exit code is always -1 for steps that reference template defined in same yaml #13297

Open
2 of 4 tasks
tooptoop4 opened this issue Jul 2, 2024 · 8 comments
Open
2 of 4 tasks

Comments

@tooptoop4
Copy link
Contributor

tooptoop4 commented Jul 2, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

with the below workflow and retryStrategy you would expect it not to retry since exit code 2 is not in the retryStrategy but it will in fact retry because lastRetry.exitCode is always going to -1

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello
  templates:
  - name: hello-hello-hello
    steps:
    - - name: whalesay1
        template: whalesay1
  - name: whalesay1
    container:
      image: ubuntu:latest
      command: ["/bin/bash", "-c"]
      args: ["exit 2"]
templateDefaults:
  retryStrategy:
    limit: 1
    retryPolicy: "Always"
    expression: 'lastRetry.status == "Error" or (lastRetry.status == "Failed" and (asInt(lastRetry.exitCode) in [255,137,143] or (asInt(lastRetry.exitCode) in [-1])))'
    backoff:
      duration: "75"
      factor: 1
      maxDuration: "300"

more context in #13228 (comment)

Version

3.4.11

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello
  templates:
  - name: hello-hello-hello
    steps:
    - - name: whalesay1
        template: whalesay1
  - name: whalesay1
    container:
      image: ubuntu:latest
      command: ["/bin/bash", "-c"]
      args: ["exit 2"]

Logs from the workflow controller

see #12572 (comment)

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

n/a
@jswxstw
Copy link
Member

jswxstw commented Jul 2, 2024

I strongly recommend not including retryStrategy in templateDefaults now, because it will be applied to both template hello-hello-hello and whalesay1, which causes a lot of meaningless retries.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello
  templateDefaults:
    retryStrategy:
      limit: 1
  templates:
  - name: hello-hello-hello
    steps:
    - - name: whalesay1
        template: whalesay1
  - name: whalesay1
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["exit 2"]
image

retryStrategy should not be applied to template of Steps type, which may introducing other issues like #12009, #12010.

@agilgur5 agilgur5 changed the title exit code is always -1 instead of actual exit code with steps that reference a template defined in the same yaml exit code is always -1 for steps that reference template defined in same yaml Jul 2, 2024
@tooptoop4
Copy link
Contributor Author

wow @jswxstw yours is even worse, 4 pods run, for me even 2 pods run is bad instead of 1! this is P1 issue right?

@jswxstw
Copy link
Member

jswxstw commented Jul 3, 2024

wow @jswxstw yours is even worse, 4 pods run, for me even 2 pods run is bad instead of 1! this is P1 issue right?

Not really, the workflow I showed is equivalent to the following workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello    
  templates:
  - name: hello-hello-hello
    retryStrategy:
      limit: 1
    steps:
    - - name: whalesay1
        template: whalesay1
  - name: whalesay1
    retryStrategy:
      limit: 1
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["exit 2"]

Template whalesay1 and hello-hello-hello both retried two times according to the retryStrategy which is also as expected.

@tooptoop4
Copy link
Contributor Author

but exit code 2 is not in the retrystrategy

@jswxstw
Copy link
Member

jswxstw commented Jul 3, 2024

but exit code 2 is not in the retrystrategy

The node of template hello-hello-hello is type of Steps which does not have exit code, and it should not have a retrystrategy set.

@tooptoop4
Copy link
Contributor Author

tooptoop4 commented Jul 3, 2024

i'm thinking few things need to change:

  1. advice not to use https://argo-workflows.readthedocs.io/en/latest/template-defaults/ at global controller level due to this issue
  2. could templateDefaults be automatically not applied to steps but applied to all other nodes? as i really want a blanket rule for all pods run
  3. could Steps be changed to get exit code from underlying pod

for point 2 something like below

func (woc *wfOperationCtx) mergedTemplateDefaultsInto(originalTmpl *wfv1.Template) error {
	if woc.execWf.Spec.TemplateDefaults != nil {
		originalTmplType := originalTmpl.GetType()

		var tmplDefaultsMap map[string]interface{}
		tmplDefaultsJson, err := json.Marshal(woc.execWf.Spec.TemplateDefaults)
		if err != nil {
			return err
		}
		err = json.Unmarshal(tmplDefaultsJson, &tmplDefaultsMap)
		if err != nil {
			return err
		}
		if originalTmplType == wfv1.TemplateTypeSteps {
			delete(tmplDefaultsMap, "retryStrategy")
		}

		tmplDefaultsJson, err = json.Marshal(tmplDefaultsMap)
		if err != nil {
			return err
		}

		targetTmplJson, err := json.Marshal(originalTmpl)
		if err != nil {
			return err
		}

		resultTmpl, err := strategicpatch.StrategicMergePatch(tmplDefaultsJson, targetTmplJson, wfv1.Template{})
		if err != nil {
			return err
		}
		err = json.Unmarshal(resultTmpl, originalTmpl)
		if err != nil {
			return err
		}
		originalTmpl.SetType(originalTmplType)
	}
	return nil
}

instead of

func (woc *wfOperationCtx) mergedTemplateDefaultsInto(originalTmpl *wfv1.Template) error {
?

@jswxstw
Copy link
Member

jswxstw commented Jul 3, 2024

2. could templateDefaults be automatically not applied to steps but applied to all other nodes? as i really want a blanket rule for all pods run

It makes sense for me, but I'm afraid other members may have different opinions.

3. could Steps be changed to get exit code from underlying pod

I'm afraid not. What if two parallel steps both failed?

@Joibel
Copy link
Member

Joibel commented Jul 3, 2024

I don't think we can reasonably change the behavior of templateDefaults as is, it would be a breaking change. A new default for "just pods" would be acceptable, but we'd need some refactoring of the code around this because currently the sources of information when composing up individual nodes is large and this would expand that list again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants