Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make "Pod was rejected:" errors transient #13842

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tooptoop4
Copy link
Contributor

Fixes #12572

@shuangkun shuangkun added the area/retryStrategy Template-level retryStrategy label Nov 1, 2024
@tczhao
Copy link
Member

tczhao commented Dec 5, 2024

This error happens when

  • a node in your Kubernetes cluster runs out of disk space
  • timing/logic between scheduler and kubelet, usage by other pods in the same node could cause issue when the pod got scheduled but failed to start

I don't believe this is a transient pattern as the cluster node could be struggling for real reason.
Many other error messages could match the "Pod was rejected" pattern, and should be handled properly at the kubelet end.
Argo wf user can use retry: Always to retry such pod

@tooptoop4
Copy link
Contributor Author

i don't want to retry for every type of error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retryStrategy Template-level retryStrategy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transient error? Pod was rejected: The node had condition: [DiskPressure].
3 participants