-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etcd get full with ~500 workflows. #12802
Comments
So ~2500 concurrent Pods total?
For many Workflows and large Workflows, it may indeed stress the k8s API and There are a few features you may want to use that are well documented:
EDIT: Some less documented options include:
|
Status:
Recent Efforts:
|
Yea Workflows in general have a lot of diverse use-cases, so capacity planning can be challenging. Configurations that are ideal for short Workflows are not necessarily ideal for long Workflows, etc.
Archiving is asynchronous. The entire Controller is async, it's all goroutines.
This sounds like it might be getting CPU starved? Without detailed metrics etc it's pretty hard to dive into details. It also sounds a bit like #11948, which was fixed in 3.4.14 and later. Not entirely the same though from the description (you have an etcd OOM vs a Controller OOM and your archive is growing vs your live Workflows).
You also checked this box, but are not on
You can use
If you're creating as many Workflows as you're deleting, that sounds possible. Again, you didn't provide metrics, but those would be ideal to track when doing any sort of performance tuning.
Are you on a managed k8s control plane provider? E.g. EKS, AKS, GKE, etc? Those try to auto-scale and have pre-set limits, so that can certainly happen. If you're using a self-managed control plane (e.g. kOps), you can vertically scale etcd and the rest of the k8s control plane (as well as horizontally scale to an extent, as etcd is fully consistent and so will eventually hit an upper bound).
I listed this in my previous comment -- |
Sorry, I was limited by NDA, and I am going to expose more details.
Current Standalone MySQL instance quota: 60-80G mem and local nvme disk. When the size of archived workflows within 30-45 days reached ~250G, queries and writing on table
Self-manager cluster:
We've enabled etcd compact and compression regularly if triggers by the DB size metrics now. That is a hack.
It is enabled. Related config I could expose: Persistence: connectionPool:
maxIdleConns: 100
maxOpenConns: 0
connMaxLifetime: 0s
nodeStatusOffLoad: true
archive: true
archiveTTL: 7d Workflow defaults: spec:
ttlStrategy:
secondsAfterCompletion: 0
secondsAfterSuccess: 0
secondsAfterFailure: 0
podGC:
strategy: OnPodCompletion
parallelism: 3 Workflow controller args args:
- '--configmap'
- workflow-controller-configmap
- '--executor-image'
- 'xxxxx/argoexec:v3.4.10'
- '--namespaced'
- '--workflow-ttl-workers=8' # 4->8
- '--pod-cleanup-workers=32' # 4->32
- '--workflow-workers=64' # 32->64
- '--qps=50'
- '--kube-api-burst=90' # 60->90
- '--kube-api-qps=60' # 40->60 Executor config imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 10m
memory: 64Mi
limits:
cpu: 1000m
memory: 512Mi There are some desensitized etcd and argo metrics screenshots, where the first one shows etcd db size varies rapidly, and the following one shows the count of workflows and pods in argo namespace at the same time. |
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs. |
@leryn1122 can u include graph of apiserver_storage_objects{resource="events"} ? i'm facing same issue and raised #13042 + #13089 i wonder if ARGO_PROGRESS_PATCH_TICK_DURATION as 0 will help to make less PATCH events too are u setting argo-workflows/docs/workflow-controller-configmap.yaml Lines 36 to 42 in 026b14e
|
Possible solutions:
|
Oh I forgot to mention earlier, there is also the environment variable |
@leryn1122 can u see what exactly is being changed on workflows.argoproj.io/workflowtaskresult.argoproj.io ? also is it every 10 seconds? |
Pre-requisites
:latest
What happened/what did you expect to happen?
We run ~500 workflows and ~500 pods concurrently as offline tasks in prod env. Etcd got full rapidly at the size of 8G.
It resulted in that etcd and apiserver turned into unavailable and the argo workflow controller auto restarted frequently.
Our team concluded that etcd and apiserver may be unavailable if running and pending workflows flood into etcd according to monitoring and metrics.
For now, the team’s solutions are:
It is expected that argo workflows do not flood into etcd or impact the stability of whole cluster.
Version
v3.4.10
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Limited by NDA.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: