Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(operator): allow retries to consider exit code from init container and don't consider node as pending if init failed. Fixes #11354/#10717/#10045 #13858

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tooptoop4
Copy link
Contributor

@tooptoop4 tooptoop4 commented Nov 3, 2024

Fixes #11354 and #10717 and #10045

Before this fix it would always go into pending because main container was waiting state (

woc.markNodePhase(ctrNodeName, wfv1.NodePending)
) even though init container already terminated with non-0 exit

This supersedes #13852

cc @terrytangyuan

@tooptoop4 tooptoop4 changed the title fix(operator): allow retries to consider exit code from init container and don't consider node as pending if init failed. Fixes #11354 fix(operator): allow retries to consider exit code from init container and don't consider node as pending if init failed. Fixes #11354/#10717/#10045 Nov 3, 2024
@jswxstw
Copy link
Member

jswxstw commented Nov 4, 2024

Before this fix it would always go into pending because main container was waiting state even though init container already terminated with non-0 exit

This will only be encountered when using ContainerSet, right?

@tooptoop4
Copy link
Contributor Author

Before this fix it would always go into pending because main container was waiting state even though init container already terminated with non-0 exit

This will only be encountered when using ContainerSet, right?

no, see the logs/comments in the linked issues

@toralf
Copy link

toralf commented Dec 5, 2024

We do regularly observe this issue here at work too for 3.5.12

@epifanov6
Copy link

epifanov6 commented Dec 5, 2024

We face this issue regularly, adding retry might very help.
We use 3.6.2.

@jswxstw
Copy link
Member

jswxstw commented Dec 6, 2024

Before this fix it would always go into pending because main container was waiting state even though init container already terminated with non-0 exit

@tooptoop4 I think your PR does not truly resolve this issue, because the nodes in the containerSet are initialized to Pending by default. Could you please add some tests?

_ = woc.initializeNode(ctxNodeName, wfv1.NodeTypeContainer, templateScope, orgTmpl, node.ID, wfv1.NodePending, opts.nodeFlag)

@tooptoop4
Copy link
Contributor Author

@jswxstw this has nothing to do with containerset, i don't run that workflow type

@jswxstw
Copy link
Member

jswxstw commented Dec 6, 2024

@jswxstw this has nothing to do with containerset, i don't run that workflow type

But the line of code you posted here is only relevant to the issue with containerSet.

woc.markNodePhase(ctrNodeName, wfv1.NodePending)

Also, if you are not using containerSet, the node will be Failed if init container fails.
Could you provide the workflow you used if this issue does exist?

@tooptoop4
Copy link
Contributor Author

that block is not limited to containerSet

@jswxstw
Copy link
Member

jswxstw commented Dec 6, 2024

that block is not limited to containerSet

Please provide the workflow you tested to prove it.

@jswxstw
Copy link
Member

jswxstw commented Dec 6, 2024

@toralf @epifanov6 Could you please clarify whether the issue you encountered is related to the missing exit code of the init container, or is it that the node is Pending when the init container fails?

@tooptoop4
Copy link
Contributor Author

see linked issues for all the details

@epifanov6
Copy link

@jswxstw Init container is in state "Terminated", because of the reason "Error".
Then main container is in state "Waiting".
I believe that retry of the init container could help it this situation.

@jswxstw
Copy link
Member

jswxstw commented Dec 10, 2024

@jswxstw Init container is in state "Terminated", because of the reason "Error". Then main container is in state "Waiting". I believe that retry of the init container could help it this situation.

@epifanov6 The node should be Failed if init container fails, which by setting retryStrategy, you can achieve automatic retries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pod failed: Error (exit code 255) but no retry
4 participants