death by `workflow never reconciled` #13460

static-moonlight · 2024-08-13T08:58:06Z

static-moonlight
Aug 13, 2024

The general story, as I understand it:

An unknown number of workflows got into a state, which was reported as "workflow never reconciled". This lead to the health check to fail over and over again ... which basically killed the workflow controller. Meanwhile more and more workflows are being put into the backlog, but nothing gets executed anymore. The ever increasing number of pending/pre-pending workflows also wasn't helpful (I guess). The system died ... probably in the moment when the first workflows entered that problematic state.

This massive gap indicates, that no workflows have been executed in that time. In the ui I found countless workflows, pending, waiting to be scheduled by the workflow controller, which at this point only panicked about an unknown number of "workflow never reconciled" warnings and totally forgot to care about the rest. Also, the problem seemed to start abruptly.

$ kubectl get pods -n argo
NAME                                   READY   STATUS    RESTARTS           AGE
argo-server-7569f7f459-brkr9           1/1     Running   0                  24d
database-0                             1/1     Running   0                  24d
workflow-controller-76ffbdcf8f-57p67   1/1     Running   1190 (3m29s ago)   24d

The number of restarts indicates, that the workflow controller was automatically restarted every ~6 minutes.

$ kubectl describe pod -n argo workflow-controller-76ffbdcf8f-57p67
[...]
    Image:         quay.io/argoproj/workflow-controller:v3.5.8
[...]
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 12 Aug 2024 16:32:23 +0000
      Finished:     Mon, 12 Aug 2024 16:36:23 +0000
    Ready:          False
    Restart Count:  1190
    Liveness:       http-get http://:6060/healthz delay=90s timeout=30s period=60s #success=1 #failure=3
[...]
Events:
  Type     Reason     Age                       From     Message
  ----     ------     ----                      ----     -------
[...]
  Warning  Unhealthy  5m46s (x3575 over 6d10h)  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 500
[...]

$ kubectl logs -n argo workflow-controller-76ffbdcf8f-57p67 | grep healthz
time="2024-08-12T16:34:23.226Z" level=info msg=healthz age=5m0s err="workflow never reconciled: [...]-n52m2" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=
time="2024-08-12T16:35:23.226Z" level=info msg=healthz age=5m0s err="workflow never reconciled: [...]-p8mbg" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=
time="2024-08-12T16:36:23.222Z" level=info msg=healthz age=5m0s err="workflow never reconciled: [...]-sjj9j" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=

From the logs I can assume, that at least 3 workflows had that never-reconciled state. The actual number is unknown.

I have no idea what the root cause is. It took me over 4 hours to manually clean this up (I had to manually delete >30k workflows from my system). After that was finished, the system recovered. I find it concerning though, that an invalid workflow state can bring the entire system down.

(1) Does anyone know how this state ("workflow never reconciled") comes to be?

(2) Does anyone know what I could do to prevent/stop a cascade like this to keep the workflow system alive or to allow auto recovery?

(3) How can it be, that this "workflow never reconciled" problem completely breaks the workflow controller and renders it useless? I mean, It's just a bunch of invalid workflow states, right? Why can't the workflow controller just ignore them (in terms of usual business) and keep going with the rest that isn't broken? Of course, the workflow controller should report them. But is that a good enough reason to stop the world?

Thankfully, this only happened in a testing environment. BUT. I got to say, my confidence into Argo Workflows really took a hit here. If this happens in production, I'll have some explaining to do, so I need a little help here.

ospiegel91 · 2024-09-01T09:52:12Z

ospiegel91
Sep 1, 2024

this is huge problem, that happens to me even when the controller is only at 30% memory utilization

0 replies

agilgur5 · 2024-09-01T16:29:00Z

agilgur5
Sep 1, 2024

As far as I've seen, this only happens in one of two scenarios:
1. lack of CPU and/or k8s API pressure/throttling/rate limiting. This is the case of Liveness probe fails with 500 and workflow never reconciled -- under-resourced k8s apiserver #11051 (comment)
2. An invalid Workflow that the controller is having trouble resolving. This is the case of Don't Crash Controller when an Invalid Workflow Object is present #10872
Depends on the scenario:
1. As far as I understand, the healthcheck failing in this case seems intentional, as it will relieve some of this pressure and may finish processing some during init (before it gets throttled again). So that is supposed to facilitate auto recovery as I understand
  1. Increasing CPU and k8s control plane resources will help mitigate this. You can also configure the healthcheck to last longer, including with the HEALTHZ_* environment variables.
2. These are all bugs. The Workflow Informer is "tolerant" in that it knows to ignore invalid workflows, but it seems like sometimes an edge case is not handled. The error message should name the problematic workflow and filing a reproducible bug report with it will help to debug a fix.
  1. Doing validation on all Workflows beforehand, via the API or argo lint etc, can help mitigate that. But it's certainly possible for a validation miss to happen there. For k8s submissions, see Validating Admission Webhook #13503
  2. You can also disable the liveness probe and alert on these problematic workflows instead.
In the first scenario, the liveness failure seems to be intended. In the second scenario, indeed the Controller normally ignores them, but it seems like sometimes an edge case is unhandled

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

death by `workflow never reconciled` #13460

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

death by workflow never reconciled #13460

static-moonlight Aug 13, 2024

Replies: 2 comments

ospiegel91 Sep 1, 2024

agilgur5 Sep 1, 2024

death by `workflow never reconciled` #13460

static-moonlight
Aug 13, 2024

ospiegel91
Sep 1, 2024

agilgur5
Sep 1, 2024