death by workflow never reconciled
#13460
Unanswered
static-moonlight
asked this question in
Q&A
Replies: 2 comments
-
this is huge problem, that happens to me even when the controller is only at 30% memory utilization |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The general story, as I understand it:
An unknown number of workflows got into a state, which was reported as "workflow never reconciled". This lead to the health check to fail over and over again ... which basically killed the workflow controller. Meanwhile more and more workflows are being put into the backlog, but nothing gets executed anymore. The ever increasing number of pending/pre-pending workflows also wasn't helpful (I guess). The system died ... probably in the moment when the first workflows entered that problematic state.
This massive gap indicates, that no workflows have been executed in that time. In the ui I found countless workflows, pending, waiting to be scheduled by the workflow controller, which at this point only panicked about an unknown number of "workflow never reconciled" warnings and totally forgot to care about the rest. Also, the problem seemed to start abruptly.
The number of restarts indicates, that the workflow controller was automatically restarted every ~6 minutes.
$ kubectl describe pod -n argo workflow-controller-76ffbdcf8f-57p67 [...] Image: quay.io/argoproj/workflow-controller:v3.5.8 [...] State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 2 Started: Mon, 12 Aug 2024 16:32:23 +0000 Finished: Mon, 12 Aug 2024 16:36:23 +0000 Ready: False Restart Count: 1190 Liveness: http-get http://:6060/healthz delay=90s timeout=30s period=60s #success=1 #failure=3 [...] Events: Type Reason Age From Message ---- ------ ---- ---- ------- [...] Warning Unhealthy 5m46s (x3575 over 6d10h) kubelet Liveness probe failed: HTTP probe failed with statuscode: 500 [...]
From the logs I can assume, that at least 3 workflows had that never-reconciled state. The actual number is unknown.
I have no idea what the root cause is. It took me over 4 hours to manually clean this up (I had to manually delete >30k workflows from my system). After that was finished, the system recovered. I find it concerning though, that an invalid workflow state can bring the entire system down.
(1) Does anyone know how this state ("workflow never reconciled") comes to be?
(2) Does anyone know what I could do to prevent/stop a cascade like this to keep the workflow system alive or to allow auto recovery?
(3) How can it be, that this "workflow never reconciled" problem completely breaks the workflow controller and renders it useless? I mean, It's just a bunch of invalid workflow states, right? Why can't the workflow controller just ignore them (in terms of usual business) and keep going with the rest that isn't broken? Of course, the workflow controller should report them. But is that a good enough reason to stop the world?
Thankfully, this only happened in a testing environment. BUT. I got to say, my confidence into Argo Workflows really took a hit here. If this happens in production, I'll have some explaining to do, so I need a little help here.
Beta Was this translation helpful? Give feedback.
All reactions