fix: make "Pod was rejected:" errors transient #13842

tooptoop4 · 2024-10-30T20:34:10Z

tczhao · 2024-12-05T13:00:31Z

This error happens when

a node in your Kubernetes cluster runs out of disk space
timing/logic between scheduler and kubelet, usage by other pods in the same node could cause issue when the pod got scheduled but failed to start

I don't believe this is a transient pattern as the cluster node could be struggling for real reason.
Many other error messages could match the "Pod was rejected" pattern, and should be handled properly at the kubelet end.
Argo wf user can use retry: Always to retry such pod

tooptoop4 · 2024-12-06T23:34:25Z

i don't want to retry for every type of error

tooptoop4 added 2 commits October 31, 2024 07:32

Signed-off-by: tooptoop4 <[email protected]>

f5923ec

Signed-off-by: tooptoop4 <[email protected]>

f8400ff

shuangkun added the area/retryStrategy Template-level retryStrategy label Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make "Pod was rejected:" errors transient #13842

fix: make "Pod was rejected:" errors transient #13842

tooptoop4 commented Oct 30, 2024

tczhao commented Dec 5, 2024

tooptoop4 commented Dec 6, 2024

fix: make "Pod was rejected:" errors transient #13842

Are you sure you want to change the base?

fix: make "Pod was rejected:" errors transient #13842

Conversation

tooptoop4 commented Oct 30, 2024

tczhao commented Dec 5, 2024

tooptoop4 commented Dec 6, 2024