Workflow / event pods block scale down #9481

elizabethking2 · 2022-08-31T11:04:46Z

Checklist

Double-checked my configuration.
Tested using the latest version.
Used the Emissary executor.

Versions

Argo workflows version: v3.3.6
Argo events version: 1.12.0
K8s GKE version: v1.21.1

Summary

After a large event kicked off 100s of workflows several weeks ago, our prod cluster has not been able to scale back down. Both workflow and event pods block the cluster scale down due to Pod is blocking scale down because it has local storage and Pod is blocking scale down because it's not backed by a controller in GKE:

What is the best practice here? Both these warning suggest adding a safe-to-evict annotation - is this safe to add?

Worth noting that both CPU and memory utilisation are low:

Additionally, we've implemented pod disruption budgets to reduce the chance of voluntary disruption of workflow pods. In the meantime are investigating internally if this could be one factor blocking the scale down after a surge.

Diagnostics

This can be reproduced by kicking off +100 workflows that sleep for 1000+ seconds.

We see ~13k logs / hour on local storage scale down issues:

And ~100 logs / hour on the no controller scale down issue:

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

sarabala1979 · 2022-08-31T16:46:25Z

It looks like more related to the GKE issue.

alexec · 2022-09-05T20:16:59Z

+1 to @sarabala1979. Does not seem like an Argo issue. Please re-open if you have more info.

dis-sid · 2024-10-21T17:39:42Z

Hello, no one answered his question though, GKE needs pods to be managed by something deployment, statefulset etc or else use the safe-to-evict annotation to be able to scale down the number of nodes in the cluster. Is this argo "orchestrator" pod safe to evict ? (I don't know the implementation details but if it retries until success without side effects I'd consider it "safe" enough to be evicted on scale downs)

agilgur5 · 2024-10-22T21:55:45Z

Is this argo "orchestrator" pod safe to evict ?

The Controller is designed to be resilient to restarts as it stores all state in its managed CRs. Intermediate state, such as Pod changes, may be missed during downtime however. See also the "High Availability" documentation.

needs pods to be managed by something deployment, statefulset etc

The Controller is also backed by a Deployment currently.

This issue is asking about individual Workflow Pods though, for which it entirely depends on how you designed your tasks -- Argo cannot answer that for you as it is in user-land.

elizabethking2 added type/bug triage labels Aug 31, 2022

alexec closed this as completed Sep 5, 2022

agilgur5 added area/controller Controller issues, panics area/manifests labels Oct 22, 2024

agilgur5 added type/support User support issue - likely not a bug and removed type/bug area/manifests labels Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow / event pods block scale down #9481

Workflow / event pods block scale down #9481

elizabethking2 commented Aug 31, 2022 •

edited

Loading

sarabala1979 commented Aug 31, 2022

alexec commented Sep 5, 2022

dis-sid commented Oct 21, 2024 •

edited

Loading

agilgur5 commented Oct 22, 2024 •

edited

Loading

Workflow / event pods block scale down #9481

Workflow / event pods block scale down #9481

Comments

elizabethking2 commented Aug 31, 2022 • edited Loading

Checklist

Versions

Summary

Diagnostics

sarabala1979 commented Aug 31, 2022

alexec commented Sep 5, 2022

dis-sid commented Oct 21, 2024 • edited Loading

agilgur5 commented Oct 22, 2024 • edited Loading

elizabethking2 commented Aug 31, 2022 •

edited

Loading

dis-sid commented Oct 21, 2024 •

edited

Loading

agilgur5 commented Oct 22, 2024 •

edited

Loading