Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd get full with ~500 workflows. #12802

Open
2 of 4 tasks
leryn1122 opened this issue Mar 14, 2024 · 9 comments
Open
2 of 4 tasks

Etcd get full with ~500 workflows. #12802

leryn1122 opened this issue Mar 14, 2024 · 9 comments
Labels
area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more type/support User support issue - likely not a bug

Comments

@leryn1122
Copy link

leryn1122 commented Mar 14, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

We run ~500 workflows and ~500 pods concurrently as offline tasks in prod env. Etcd got full rapidly at the size of 8G.
It resulted in that etcd and apiserver turned into unavailable and the argo workflow controller auto restarted frequently.
Our team concluded that etcd and apiserver may be unavailable if running and pending workflows flood into etcd according to monitoring and metrics.

For now, the team’s solutions are:

  • Limiting workflows quotas
  • Optimizing the size of workflow template rendored by biz
  • Writing scripts to check and compress etcd if full as a schedule task
  • Migrating biz argo into another cluster alone

It is expected that argo workflows do not flood into etcd or impact the stability of whole cluster.

Version

v3.4.10

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Limited by NDA.

Logs from the workflow controller

time="2024-03-06T01:59:48.872Z" level=info msg="Mark node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62[0].xxxxxxx[1].subtask(20:raw/587/2023/12/6/1734100703048318977/ros2/20231206104000_20231206104046_5m/raw_587_20231206104000_20231206104046.db3)[2].xxxxxxx[0].xxxxxxx[0].xxxxxxx(0) as Pending, due to: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=822, limited: count/pods=600" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.875Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2759181252 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3992814895]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.872Z" level=info msg="Transient error: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=822, limited: count/pods=600"
time="2024-03-06T01:59:48.187Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="slice-587-20231201141424-20240128125944-20240201143506-fen7hx62[0].xxxxxxx[1].subtask(21:raw/587/2024/1/26/1751125289026650113/ros2/20240126140500_20240126141000_5m/raw_587_20240126140500_20240126141000.db3)[2].xxxxxxx[0].xxxxxxx[1].vision2d-lidar-fusion-match(0)" nodePhase=Pending recentlyStarted=false workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.185Z" level=info msg="Processing workflow" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.195Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-814885821 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2761991884]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.875Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-765220508 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3024955863]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.206Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-239402416 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-193874387]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.873Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3590128111 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.187Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="slice-587-20231201141424-20240128125944-20240201143506-fen7hx62[0].xxxxxxx[1].subtask(20:raw/587/2023/12/6/1734100703048318977/ros2/20231206104000_20231206104046_5m/raw_587_20231206104000_20231206104046.db3)[2].xxxxxxx[0].xxxxxxx[0].xxxxxxx(0)" nodePhase=Pending recentlyStarted=false workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.186Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.206Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1845491224 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1265119851]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.872Z" level=info msg="node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944 message: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=822, limited: count/pods=600" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.872Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-285032592 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.873Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3624891064 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.187Z" level=info msg="node unchanged" nodeID=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2773247126
time="2024-03-06T01:59:49.647Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.624Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1031877465 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.623Z" level=info msg="Transient error: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1351688902\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=821, limited: count/pods=600"
time="2024-03-06T01:59:49.648Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.632Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.633Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62

Logs from in your workflow's wait container

N/A
@agilgur5 agilgur5 added type/support User support issue - likely not a bug and removed type/bug labels Mar 15, 2024
@agilgur5
Copy link

agilgur5 commented Mar 15, 2024

We run ~500 workflows and ~500 pods concurrently

So ~2500 concurrent Pods total?

It is expected that argo workflows do not flood into etcd or impact the stability of whole cluster.

For many Workflows and large Workflows, it may indeed stress the k8s API and etcd. That's not really an Argo limitation, that's how k8s works with shared control plane resources.

There are a few features you may want to use that are well documented:

EDIT: Some less documented options include:

  • Per below comment, disabling nodeEvents in the ConfigMap to save space in etcd (at the cost of less available tracking):
    # Whether or not to emit events on node completion. These can take a up a lot of space in
    # k8s (typically etcd) resulting in errors when trying to create new events:
    # "Unable to create audit event: etcdserver: mvcc: database space exceeded"
    # This config item allows you to disable this.
    # (since v2.9)
    nodeEvents: |
    enabled: true

@agilgur5 agilgur5 added area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more problem/more information needed Not enough information has been provide to diagnose this issue. labels Mar 15, 2024
@leryn1122
Copy link
Author

Status:

  • Currently it is ~4000 pending workflows , ~1000 runing workflows , ~1000 pods for one biggest argo. And 4 argos of smaller scale, ignored.
  • The former cluster has >10 nodes for argo and ~180 nodes totally. Now we build a standalone cluster for argo.
  • Workflows behave different from business. Some of them finished in several seconds, while some run for a few hours.

Recent Efforts:

  • Standalone MySQL archiving has already been enabled in months.
  • Another probelm we solved is that if db got stuck with growing data when archiving, the workflow controller doesn't handle any request anymore. Arching should be asynchronized I think. Archiving goes slowly, or even stuck after argo_archived_workflows reached ~250G. We wrote a cronjob to daily delete argo_archived_workflows with workflows finished weeks ago to prevent archiving stuck.
  • --workflow-ttl-workers and --pod-cleanup-workers: It was attempted to be modified. It works but does not save etcd from stress.
  • Tunning parallelism and pod limitation is the primary way in past weeks. A lower limitation does not satisfy the business requirements. A rapid/jump change leads to etcd be unstable in my experiences.
  • We've urged developpers to reduces the size of workflow template from 200K to smaller ones.

@agilgur5
Copy link

agilgur5 commented Mar 19, 2024

  • Workflows behave different from business. Some of them finished in several seconds, while some run for a few hours.

Yea Workflows in general have a lot of diverse use-cases, so capacity planning can be challenging. Configurations that are ideal for short Workflows are not necessarily ideal for long Workflows, etc.

Arching should be asynchronized I think [sic]

Archiving is asynchronous. The entire Controller is async, it's all goroutines.

Archiving goes slowly, or even stuck after argo_archived_workflows reached ~250G.

This sounds like it might be getting CPU starved? Without detailed metrics etc it's pretty hard to dive into details.

It also sounds a bit like #11948, which was fixed in 3.4.14 and later. Not entirely the same though from the description (you have an etcd OOM vs a Controller OOM and your archive is growing vs your live Workflows).

  • I can confirm the issue exists when I tested with :latest

v3.4.10

You also checked this box, but are not on latest. Please fill out the issue template accurately, those questions are asked for very good reasons.

We wrote a cronjob to daily delete argo_archived_workflows with workflows finished weeks ago to prevent archiving stuck.

You can use archiveTTL for this as a built-in option.

It works but does not save etcd from stress.

If you're creating as many Workflows as you're deleting, that sounds possible. Again, you didn't provide metrics, but those would be ideal to track when doing any sort of performance tuning.

A rapid/jump change leads to etcd be unstable in my experiences.

Are you on a managed k8s control plane provider? E.g. EKS, AKS, GKE, etc? Those try to auto-scale and have pre-set limits, so that can certainly happen. If you're using a self-managed control plane (e.g. kOps), you can vertically scale etcd and the rest of the k8s control plane (as well as horizontally scale to an extent, as etcd is fully consistent and so will eventually hit an upper bound).

We've urged developpers to reduces the size of workflow template from 200K to smaller ones. [sic]

I listed this in my previous comment -- nodeStatusOffload can help with this.

@leryn1122
Copy link
Author

Sorry, I was limited by NDA, and I am going to expose more details.
Configuration and thresholds vary in past months.

You can use archiveTTL for this as a built-in option.

Current archiveTTL is 7d.

Standalone MySQL instance quota: 60-80G mem and local nvme disk.

When the size of archived workflows within 30-45 days reached ~250G, queries and writing on table argo_archived_workflows go slowly. A single SQL deleting workflows cost 2-3 minutes. We attempted to by the table index and MySQL hint, but it does not effect evidently. So I rebuilt MySQL and added a new hacking cronjob mentioned before and now it runs stable.

Are you on a managed k8s control plane provider? E.g. EKS, AKS, GKE, etc? Those try to auto-scale and have pre-set limits, so that can certainly happen. If you're using a self-managed control plane (e.g. kOps), you can set vertically scale etcd and the rest of the k8s control plane (as well as horizontally scale to an extent, as etcd is fully consistent an so will eventually hit an upper bound).

Self-manager cluster:

  • Kubernetes v1.21.9
  • Kubesphere v3.2.1

We've enabled etcd compact and compression regularly if triggers by the DB size metrics now. That is a hack.

I listed this in my previous comment --nodeStatusOffload can help with this.

It is enabled. Related config I could expose:

Persistence:

connectionPool:
  maxIdleConns: 100
  maxOpenConns: 0
  connMaxLifetime: 0s
nodeStatusOffLoad: true
archive: true
archiveTTL: 7d

Workflow defaults:

spec:
  ttlStrategy:
    secondsAfterCompletion: 0
    secondsAfterSuccess: 0
    secondsAfterFailure: 0
  podGC:
    strategy: OnPodCompletion
  parallelism: 3

Workflow controller args

args:
  - '--configmap'
  - workflow-controller-configmap
  - '--executor-image'
  - 'xxxxx/argoexec:v3.4.10'
  - '--namespaced'
  - '--workflow-ttl-workers=8'      # 4->8
  - '--pod-cleanup-workers=32'  # 4->32
  - '--workflow-workers=64'        # 32->64
  - '--qps=50'
  - '--kube-api-burst=90'  # 60->90
  - '--kube-api-qps=60'    # 40->60

Executor config

imagePullPolicy: IfNotPresent
resources:
  requests:
    cpu: 10m
    memory: 64Mi
  limits:
    cpu: 1000m
    memory: 512Mi

There are some desensitized etcd and argo metrics screenshots, where the first one shows etcd db size varies rapidly, and the following one shows the count of workflows and pods in argo namespace at the same time.

Screenshot from 2024-03-22 09-37-36
Screenshot from 2024-03-22 09-37-39

Copy link
Contributor

github-actions bot commented Apr 8, 2024

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

@github-actions github-actions bot added problem/stale This has not had a response in some time and removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Apr 8, 2024
@tooptoop4
Copy link
Contributor

tooptoop4 commented May 13, 2024

@leryn1122 can u include graph of apiserver_storage_objects{resource="events"} ?

i'm facing same issue and raised #13042 + #13089

i wonder if ARGO_PROGRESS_PATCH_TICK_DURATION as 0 will help to make less PATCH events too

are u setting

# Whether or not to emit events on node completion. These can take a up a lot of space in
# k8s (typically etcd) resulting in errors when trying to create new events:
# "Unable to create audit event: etcdserver: mvcc: database space exceeded"
# This config item allows you to disable this.
# (since v2.9)
nodeEvents: |
enabled: true
to false too?

@leryn1122
Copy link
Author

leryn1122 commented Jun 17, 2024

@leryn1122 can u include graph of apiserver_storage_objects{resource="events"} ?

i'm facing same issue and raised #13042 + #13089

i wonder if ARGO_PROGRESS_PATCH_TICK_DURATION as 0 will help to make less PATCH events too

are u setting

# Whether or not to emit events on node completion. These can take a up a lot of space in
# k8s (typically etcd) resulting in errors when trying to create new events:
# "Unable to create audit event: etcdserver: mvcc: database space exceeded"
# This config item allows you to disable this.
# (since v2.9)
nodeEvents: |
enabled: true

to false too?

  1. apiserver_storage_objects{resource="events"} ranges from 90k ~ 15k, with maximum 30k+, while the current cluster is only used to run argo workflows.
  2. nodeEvents is enabled
  3. I wrote an etcd-jdbc. It illustrates that:
    • workflows.argoproj.io would be frequently patched at the same time when workflow status varies, which etcd version increase rapidly, e.g. a single workflow has 370+ version.
    • Count of workflowtaskresult.argoproj.io would also increase rapidly, for a test argo cluster where I was tuning on, it has 35k+ entries.

Possible solutions:
It works for my team for now. It is not guaranteed to be a nice solution for you.

  • Update your etcd to latest version, and increase db size.
  • Separate argo from your main business.
  • Use an external etcd instead of built-in one.

@agilgur5
Copy link

Oh I forgot to mention earlier, there is also the environment variable ALWAYS_OFFLOAD_NODE_STATUS that could help in this scenario as well

@tooptoop4
Copy link
Contributor

@leryn1122 can u see what exactly is being changed on workflows.argoproj.io/workflowtaskresult.argoproj.io ? also is it every 10 seconds?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more type/support User support issue - likely not a bug
Projects
None yet
Development

No branches or pull requests

3 participants