-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow / event pods block scale down #9481
Comments
It looks like more related to the GKE issue. |
+1 to @sarabala1979. Does not seem like an Argo issue. Please re-open if you have more info. |
Hello, no one answered his question though, GKE needs pods to be managed by something deployment, statefulset etc or else use the safe-to-evict annotation to be able to scale down the number of nodes in the cluster. Is this argo "orchestrator" pod safe to evict ? (I don't know the implementation details but if it retries until success without side effects I'd consider it "safe" enough to be evicted on scale downs) |
The Controller is designed to be resilient to restarts as it stores all state in its managed CRs. Intermediate state, such as Pod changes, may be missed during downtime however. See also the "High Availability" documentation.
The Controller is also backed by a Deployment currently. This issue is asking about individual Workflow Pods though, for which it entirely depends on how you designed your tasks -- Argo cannot answer that for you as it is in user-land. |
Checklist
Versions
Summary
After a large event kicked off 100s of workflows several weeks ago, our prod cluster has not been able to scale back down. Both workflow and event pods block the cluster scale down due to
Pod is blocking scale down because it has local storage
andPod is blocking scale down because it's not backed by a controller
in GKE:What is the best practice here? Both these warning suggest adding a
safe-to-evict
annotation - is this safe to add?Worth noting that both CPU and memory utilisation are low:
Additionally, we've implemented pod disruption budgets to reduce the chance of voluntary disruption of workflow pods. In the meantime are investigating internally if this could be one factor blocking the scale down after a surge.
Diagnostics
This can be reproduced by kicking off +100 workflows that sleep for 1000+ seconds.
We see ~13k logs / hour on local storage scale down issues:
And ~100 logs / hour on the no controller scale down issue:
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: