Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow / event pods block scale down #9481

Closed
3 tasks done
elizabethking2 opened this issue Aug 31, 2022 · 4 comments
Closed
3 tasks done

Workflow / event pods block scale down #9481

elizabethking2 opened this issue Aug 31, 2022 · 4 comments
Labels
area/controller Controller issues, panics type/support User support issue - likely not a bug

Comments

@elizabethking2
Copy link

elizabethking2 commented Aug 31, 2022

Checklist

  • Double-checked my configuration.
  • Tested using the latest version.
  • Used the Emissary executor.

Versions

  • Argo workflows version: v3.3.6
  • Argo events version: 1.12.0
  • K8s GKE version: v1.21.1

Summary

After a large event kicked off 100s of workflows several weeks ago, our prod cluster has not been able to scale back down. Both workflow and event pods block the cluster scale down due to Pod is blocking scale down because it has local storage and Pod is blocking scale down because it's not backed by a controller in GKE:
Screenshot 2022-08-31 at 11 28 11

What is the best practice here? Both these warning suggest adding a safe-to-evict annotation - is this safe to add?

Worth noting that both CPU and memory utilisation are low:
Screenshot 2022-08-31 at 11 32 15

Additionally, we've implemented pod disruption budgets to reduce the chance of voluntary disruption of workflow pods. In the meantime are investigating internally if this could be one factor blocking the scale down after a surge.

Diagnostics

This can be reproduced by kicking off +100 workflows that sleep for 1000+ seconds.

We see ~13k logs / hour on local storage scale down issues:
Screenshot 2022-08-31 at 11 23 45

And ~100 logs / hour on the no controller scale down issue:
Screenshot 2022-08-31 at 11 58 41


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@sarabala1979
Copy link
Member

It looks like more related to the GKE issue.

@alexec
Copy link
Contributor

alexec commented Sep 5, 2022

+1 to @sarabala1979. Does not seem like an Argo issue. Please re-open if you have more info.

@alexec alexec closed this as completed Sep 5, 2022
@dis-sid
Copy link

dis-sid commented Oct 21, 2024

Hello, no one answered his question though, GKE needs pods to be managed by something deployment, statefulset etc or else use the safe-to-evict annotation to be able to scale down the number of nodes in the cluster. Is this argo "orchestrator" pod safe to evict ? (I don't know the implementation details but if it retries until success without side effects I'd consider it "safe" enough to be evicted on scale downs)

@agilgur5 agilgur5 added area/controller Controller issues, panics area/manifests labels Oct 22, 2024
@agilgur5
Copy link

agilgur5 commented Oct 22, 2024

Is this argo "orchestrator" pod safe to evict ?

The Controller is designed to be resilient to restarts as it stores all state in its managed CRs. Intermediate state, such as Pod changes, may be missed during downtime however. See also the "High Availability" documentation.

needs pods to be managed by something deployment, statefulset etc

The Controller is also backed by a Deployment currently.

This issue is asking about individual Workflow Pods though, for which it entirely depends on how you designed your tasks -- Argo cannot answer that for you as it is in user-land.

@agilgur5 agilgur5 added type/support User support issue - likely not a bug and removed type/bug area/manifests labels Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/support User support issue - likely not a bug
Projects
None yet
Development

No branches or pull requests

5 participants