-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delay between step/pod completion and workflow completion #13671
Comments
Are there many workflows in the cluster? |
@shuangkun note the workflow i ran is not a cronworkflow (just a normal workflow). i'm not overriding defaults for --workflow-workers or any other workers related args, i noticed the queue depth of cron_wf_queue was 35 at the same time as the workflow started but then went to 0 a minute later. all other queue depths were constantly 0. time objects spent waiting in queue was 13s for cron_wf_queue. argo_workflows_operation_duration_seconds_bucket showed 3s at p95. argo_workflows_queue_adds_count was 257 for cron_wf_queue, 95 for workflow_queue, 16 for pod_cleanup_queue |
increasing --qps --burst --pod-cleanup-workers --cron-workflow-workers --workflow-workers did NOT help. one thing that is strange is pod storage fluctuates up to around 29MB and then down to 13MB, and around same time a large dag with 42 pods ran. also around same time argo_workflows_queue_adds_count metric (for workflow_queue added 500 in 5mins, for pod_cleanup_queue went to 75) but the depth never went up, the busy workers never went above 5 and there were never more than 45 workflows across all states in that time! i'm thinking the wfinformer is incomplete, is there some setting to say just look at all workflows every x seconds even if not in the informer? @shuangkun @fyp711 do u think my issue could be related to informer missing things like u raised with #13466 ? |
@tooptoop4 As you found at 13690, it is 20 minutes. |
@shuangkun it can be reproduced stably, have a constant flow of short (ie finishes in 1min) 1 step workflows running. ie at 12.03pm submit 200 wfs, at 12.04pm those prior 200 would have finished so submit another 200, at 12.05pm those prior 200 would have finished so submit another 200, at 12.06pm those prior 200 would have finished so submit another 200, at 12.07pm those prior 200 would have finished so submit another 200....continue that so u have a throughput of 12000 per hour (200 per min). then look at the finished time of the step vs finish time of the wf, time difference i have seen is 10s in best case but around 11mins in worst case i wonder if debug logs on what the informer has in its queue would help to investigate this sounds like #1416 (comment) |
@tooptoop4 Have you tried the 13690 adjustment to alleviate this problem? |
@shuangkun the bad news is lowering wf and pod resync periods to around 1-2mins did not help, i still had a case where the delay between step end time and wf end time is around 269 seconds. BUT the good news is i think i've stumbled upon the root cause, read below: the main container logs show it actually finished doing work (ie my pod running python code) at 21:32:50 (this also aligns with END TIME of the pod step in argo ui) wfcontroller logs (containing 'cleaning up pod') for this pod (so it tried several times, first time is around when i expect workflow should have finished, last time is around when it actually finished!):
kubelet logs groaning with things like this:
kubectl events shows:
so to summarise seems like kubelet was very slow to actually delete the pod, but from my point of view argo should be able to consider the workflow complete once it has sent instruction to terminate the pod (not when it actually detects that the pod was removed) for other workflows affected by delay of finish well after step finish time (but no kubelet logs related to eviction) i notice the wf controller sends terminateContainers first then 30 seconds later sends killContainers, i'm thinking it should send the most forceful option always with grace period of 0 i see #4940 added terminationGracePeriodSeconds but it requires podSpecPatch which will slow things down here is a more common case: nothing odd in kubelet, the main container logs show it actually finished doing work (ie my pod running python code) at 13:17:57 (this also aligns with END TIME of the pod step in argo ui)
|
@tooptoop4 It looks different from #13466. |
I have two suggestions, you can check if the informer listandwatch has failed(timeout), and you can check if the apiserver load is high. |
@shuangkun didn't realize there is another informer in
UPDATE: no luck unfortunately |
Does this only happen when it is large-scale? |
no, it happens everyday even if less than 100 (with single node each) workflows are running. have you looked at your argo_archived_workflows table? perhaps u can also see long seconds gap between node end time and workflow end time |
It feels like there is a missing call to |
any idea @jswxstw |
I initially suspected it might be the reason mentioned by @Joibel, but in v3.4.11, the workflow does not need to wait for |
Ah, yes... that is true, it isn't about WFTaskResults, but I still think we should have called requeue and didn't. |
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
UI shows the single step/pod completed successfully at 2024-09-27T04:31:36
but then UI shows big gap to when overall workflow completed successfully at 2024-09-27T04:36:09
Version(s)
3.4.11
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
n/a
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: