Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delay between step/pod completion and workflow completion #13671

Open
2 of 4 tasks
tooptoop4 opened this issue Sep 27, 2024 · 16 comments
Open
2 of 4 tasks

delay between step/pod completion and workflow completion #13671

tooptoop4 opened this issue Sep 27, 2024 · 16 comments
Labels
area/controller Controller issues, panics solution/outdated This is not up-to-date with the current version type/bug

Comments

@tooptoop4
Copy link
Contributor

tooptoop4 commented Sep 27, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

UI shows the single step/pod completed successfully at 2024-09-27T04:31:36
but then UI shows big gap to when overall workflow completed successfully at 2024-09-27T04:36:09

Version(s)

3.4.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

n/a

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

time=\"2024-09-27T04:29:15.202Z\" level=info msg=\"Processing workflow\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:15.234Z\" level=info msg=\"Updated phase  -> Running\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:15.236Z\" level=info msg=\"Retry node aredactwf initialized Running\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:15.237Z\" level=info msg=\"Pod node aredactwf-3720301757 initialized Pending\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:15.324Z\" level=info msg=\"Created pod: aredactwf(0) (aredactwf-mystep-3720301757)\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:15.324Z\" level=info msg=\"TaskSet Reconciliation\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:15.324Z\" level=info msg=reconcileAgentPod namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:15.370Z\" level=info msg=\"Workflow update successful\" namespace=myns phase=Running resourceVersion=426512752 workflow=aredactwf
time=\"2024-09-27T04:29:25.202Z\" level=info msg=\"Processing workflow\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:25.205Z\" level=info msg=\"Task-result reconciliation\" namespace=myns numObjs=0 workflow=aredactwf
time=\"2024-09-27T04:29:25.205Z\" level=info msg=\"node changed\" namespace=myns new.message= new.phase=Running new.progress=0/1 nodeID=aredactwf-3720301757 old.message= old.phase=Pending old.progress=0/1 workflow=aredactwf
time=\"2024-09-27T04:29:25.207Z\" level=info msg=\"node aredactwf message: retryStrategy.expression evaluated to false\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:25.207Z\" level=info msg=\"TaskSet Reconciliation\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:25.207Z\" level=info msg=reconcileAgentPod namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:25.230Z\" level=info msg=\"Workflow update successful\" namespace=myns phase=Running resourceVersion=426512956 workflow=aredactwf
time=\"2024-09-27T04:29:35.230Z\" level=info msg=\"Processing workflow\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:35.233Z\" level=info msg=\"Task-result reconciliation\" namespace=myns numObjs=0 workflow=aredactwf
time=\"2024-09-27T04:29:35.233Z\" level=info msg=\"node unchanged\" namespace=myns nodeID=aredactwf-3720301757 workflow=aredactwf
time=\"2024-09-27T04:29:35.235Z\" level=info msg=\"TaskSet Reconciliation\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:29:35.235Z\" level=info msg=reconcileAgentPod namespace=myns workflow=aredactwf
time=\"2024-09-27T04:31:39.661Z\" level=info msg=\"Processing workflow\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:31:39.663Z\" level=info msg=\"Task-result reconciliation\" namespace=myns numObjs=1 workflow=aredactwf
time=\"2024-09-27T04:31:39.663Z\" level=info msg=\"task-result changed\" namespace=myns nodeID=aredactwf-3720301757 workflow=aredactwf
time=\"2024-09-27T04:31:39.663Z\" level=info msg=\"node unchanged\" namespace=myns nodeID=aredactwf-3720301757 workflow=aredactwf
time=\"2024-09-27T04:31:39.665Z\" level=info msg=\"TaskSet Reconciliation\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:31:39.665Z\" level=info msg=reconcileAgentPod namespace=myns workflow=aredactwf
time=\"2024-09-27T04:31:39.682Z\" level=info msg=\"Workflow update successful\" namespace=myns phase=Running resourceVersion=426514616 workflow=aredactwf
time=\"2024-09-27T04:31:49.683Z\" level=info msg=\"Processing workflow\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:31:49.686Z\" level=info msg=\"Task-result reconciliation\" namespace=myns numObjs=1 workflow=aredactwf
time=\"2024-09-27T04:31:49.686Z\" level=info msg=\"task-result changed\" namespace=myns nodeID=aredactwf-3720301757 workflow=aredactwf
time=\"2024-09-27T04:31:49.686Z\" level=info msg=\"node unchanged\" namespace=myns nodeID=aredactwf-3720301757 workflow=aredactwf
time=\"2024-09-27T04:31:49.688Z\" level=info msg=\"TaskSet Reconciliation\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:31:49.688Z\" level=info msg=reconcileAgentPod namespace=myns workflow=aredactwf
time=\"2024-09-27T04:31:49.707Z\" level=info msg=\"Workflow update successful\" namespace=myns phase=Running resourceVersion=426514616 workflow=aredactwf
time=\"2024-09-27T04:36:09.656Z\" level=info msg=\"Processing workflow\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.658Z\" level=info msg=\"Task-result reconciliation\" namespace=myns numObjs=1 workflow=aredactwf
time=\"2024-09-27T04:36:09.658Z\" level=info msg=\"task-result changed\" namespace=myns nodeID=aredactwf-3720301757 workflow=aredactwf
time=\"2024-09-27T04:36:09.658Z\" level=info msg=\"node changed\" namespace=myns new.message= new.phase=Succeeded new.progress=0/1 nodeID=aredactwf-3720301757 old.message= old.phase=Running old.progress=0/1 workflow=aredactwf
time=\"2024-09-27T04:36:09.660Z\" level=info msg=\"node aredactwf phase Running -> Succeeded\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.660Z\" level=info msg=\"node aredactwf finished: 2024-09-27 04:36:09.66063598 +0000 UTC\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.660Z\" level=info msg=\"TaskSet Reconciliation\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.660Z\" level=info msg=reconcileAgentPod namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.660Z\" level=info msg=\"Running OnExit handler: exit-handler\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.661Z\" level=info msg=\"Retry node aredactwf-1656388105 initialized Running\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.661Z\" level=info msg=\"Steps node aredactwf-3191187056 initialized Running\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.661Z\" level=info msg=\"StepGroup node aredactwf-202582218 initialized Running\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.661Z\" level=info msg=\"Skipping aredactwf.onExit(0)[0].notifyError: when 'Succeeded != Succeeded' evaluated false\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.661Z\" level=info msg=\"Skipped node aredactwf-2734285361 initialized Skipped (message: when 'Succeeded != Succeeded' evaluated false)\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.661Z\" level=info msg=\"Step group node aredactwf-202582218 successful\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.661Z\" level=info msg=\"node aredactwf-202582218 phase Running -> Succeeded\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.661Z\" level=info msg=\"node aredactwf-202582218 finished: 2024-09-27 04:36:09.661610081 +0000 UTC\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.661Z\" level=info msg=\"Outbound nodes of aredactwf-2734285361 is [aredactwf-2734285361]\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.661Z\" level=info msg=\"Outbound nodes of aredactwf-3191187056 is [aredactwf-2734285361]\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.661Z\" level=info msg=\"node aredactwf-3191187056 phase Running -> Succeeded\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.661Z\" level=info msg=\"node aredactwf-3191187056 finished: 2024-09-27 04:36:09.661709572 +0000 UTC\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.662Z\" level=info msg=\"node aredactwf-1656388105 phase Running -> Succeeded\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.662Z\" level=info msg=\"node aredactwf-1656388105 message: retryStrategy.expression evaluated to false\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.662Z\" level=info msg=\"node aredactwf-1656388105 finished: 2024-09-27 04:36:09.662286778 +0000 UTC\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.662Z\" level=info msg=\"Updated phase Running -> Succeeded\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.662Z\" level=info msg=\"Marking workflow completed\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.662Z\" level=info msg=\"Marking workflow as pending archiving\" namespace=myns workflow=aredactwf
time=\"2024-09-27T04:36:09.668Z\" level=info msg=\"cleaning up pod\" action=deletePod key=myns/aredactwf-1340600742-agent/deletePod
time=\"2024-09-27T04:36:09.720Z\" level=info msg=\"Workflow update successful\" namespace=myns phase=Succeeded resourceVersion=426517669 workflow=aredactwf
time=\"2024-09-27T04:36:10.372Z\" level=info msg=\"archiving workflow\" namespace=myns uid=d4ea0321-a659-43d7-8ed9-eb839650b452 workflow=aredactwf
time=\"2024-09-27T04:36:10.466Z\" level=info msg=\"Queueing Succeeded workflow myns/aredactwf for delete in 1m4s due to TTL\"
time=\"2024-09-27T04:36:14.745Z\" level=info msg=\"cleaning up pod\" action=deletePod key=myns/aredactwf-mystep-3720301757/deletePod

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

\"2024-09-27T04:29:18.477Z\" level=info msg=\"Starting Workflow Executor\" version=v3.4.11
\"2024-09-27T04:29:18.482Z\" level=info msg=\"Using executor retry strategy\" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
\"2024-09-27T04:29:18.482Z\" level=info msg=\"monitoring progress disabled\" annotationPatchTickDuration=0s readProgressFileTickDuration=0s
\"2024-09-27T04:29:18.483Z\" level=info msg=\"Starting deadline monitor\"
\"2024-09-27T04:31:29.504Z\" level=info msg=\"Main container completed\" error=\"<nil>\"
\"2024-09-27T04:31:29.504Z\" level=info msg=\"No Script output reference in workflow. Capturing script output ignored\"
\"2024-09-27T04:31:29.504Z\" level=info msg=\"No output parameters\"
\"2024-09-27T04:31:29.504Z\" level=info msg=\"No output artifacts\"
\"2024-09-27T04:31:29.524Z\" level=info msg=\"S3 Save path: /tmp/argo/outputs/logs/main.log, key: argo_wf_logs/2024/09/27/04/29/aredactwf/aredactwf-mystep-3720301757/main.log\"
\"2024-09-27T04:31:29.525Z\" level=info msg=\"Creating minio client using AWS SDK credentials\"
\"2024-09-27T04:31:29.565Z\" level=info msg=\"Saving file to s3\" bucket=redactbuc endpoint=s3.amazonaws.com key=argo_wf_logs/2024/09/27/04/29/aredactwf/aredactwf-mystep-3720301757/main.log path=/tmp/argo/outputs/logs/main.log
\"2024-09-27T04:31:29.641Z\" level=info msg=\"Save artifact\" artifactName=main-logs duration=116.831623ms error=\"<nil>\" key=argo_wf_logs/2024/09/27/04/29/aredactwf/aredactwf-mystep-3720301757/main.log
\"2024-09-27T04:31:29.641Z\" level=info msg=\"not deleting local artifact\" localArtPath=/tmp/argo/outputs/logs/main.log
\"2024-09-27T04:31:29.641Z\" level=info msg=\"Successfully saved file: /tmp/argo/outputs/logs/main.log\"
\"2024-09-27T04:31:29.659Z\" level=info msg=\"Alloc=11298 TotalAlloc=18244 Sys=33917 NumGC=5 Goroutines=11\"
\"2024-09-27T04:31:29.659Z\" level=info msg=\"Deadline monitor stopped\"
@shuangkun
Copy link
Member

Are there many workflows in the cluster?

@tooptoop4
Copy link
Contributor Author

@shuangkun
there were 31 running workflows at the time.

note the workflow i ran is not a cronworkflow (just a normal workflow). i'm not overriding defaults for --workflow-workers or any other workers related args, i noticed the queue depth of cron_wf_queue was 35 at the same time as the workflow started but then went to 0 a minute later. all other queue depths were constantly 0. time objects spent waiting in queue was 13s for cron_wf_queue. argo_workflows_operation_duration_seconds_bucket showed 3s at p95. argo_workflows_queue_adds_count was 257 for cron_wf_queue, 95 for workflow_queue, 16 for pod_cleanup_queue

@tooptoop4
Copy link
Contributor Author

tooptoop4 commented Sep 30, 2024

increasing --qps --burst --pod-cleanup-workers --cron-workflow-workers --workflow-workers did NOT help. one thing that is strange is pod storage fluctuates up to around 29MB and then down to 13MB, and around same time a large dag with 42 pods ran. also around same time argo_workflows_queue_adds_count metric (for workflow_queue added 500 in 5mins, for pod_cleanup_queue went to 75) but the depth never went up, the busy workers never went above 5 and there were never more than 45 workflows across all states in that time!

i'm thinking the wfinformer is incomplete, is there some setting to say just look at all workflows every x seconds even if not in the informer? @shuangkun

@fyp711 do u think my issue could be related to informer missing things like u raised with #13466 ?

@shuangkun
Copy link
Member

@tooptoop4 As you found at 13690, it is 20 minutes.
Maybe we need to find out where the root problem is. Is your time difference about 5 minutes? Can it be reproduced stably? Is there a simple example?

@tooptoop4
Copy link
Contributor Author

tooptoop4 commented Oct 1, 2024

@shuangkun it can be reproduced stably, have a constant flow of short (ie finishes in 1min) 1 step workflows running. ie at 12.03pm submit 200 wfs, at 12.04pm those prior 200 would have finished so submit another 200, at 12.05pm those prior 200 would have finished so submit another 200, at 12.06pm those prior 200 would have finished so submit another 200, at 12.07pm those prior 200 would have finished so submit another 200....continue that so u have a throughput of 12000 per hour (200 per min). then look at the finished time of the step vs finish time of the wf, time difference i have seen is 10s in best case but around 11mins in worst case

i wonder if debug logs on what the informer has in its queue would help to investigate

this sounds like #1416 (comment)

@shuangkun
Copy link
Member

@tooptoop4 Have you tried the 13690 adjustment to alleviate this problem?

@shuangkun shuangkun added the area/controller Controller issues, panics label Oct 2, 2024
@tooptoop4
Copy link
Contributor Author

tooptoop4 commented Oct 3, 2024

@shuangkun the bad news is lowering wf and pod resync periods to around 1-2mins did not help, i still had a case where the delay between step end time and wf end time is around 269 seconds. BUT the good news is i think i've stumbled upon the root cause, read below:

the main container logs show it actually finished doing work (ie my pod running python code) at 21:32:50 (this also aligns with END TIME of the pod step in argo ui)
both main and wait containers switch from kube_pod_container_status_running=1/kube_pod_container_status_terminated=0 to kube_pod_container_status_running=0/kube_pod_container_status_terminated=1 at 21:33:00
BUT kube_pod_status_phase shows running until 21:37:00
workflow finish time in UI shows as 21:37:19

wfcontroller logs (containing 'cleaning up pod') for this pod (so it tried several times, first time is around when i expect workflow should have finished, last time is around when it actually finished!):

{"time":"2024-10-02T21:33:02.331549949Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:33:00.413Z\" level=info msg=\"cleaning up pod\" action=terminateContainers key=myns/mypod/terminateContainers
{"time":"2024-10-02T21:33:27.204261129Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:33:27.202Z\" level=info msg=\"cleaning up pod\" action=terminateContainers key=myns/mypod/terminateContainers
{"time":"2024-10-02T21:33:42.01999347Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:33:30.414Z\" level=info msg=\"cleaning up pod\" action=killContainers key=myns/mypod/killContainers
{"time":"2024-10-02T21:33:57.207610446Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:33:57.205Z\" level=info msg=\"cleaning up pod\" action=terminateContainers key=myns/mypod/terminateContainers
{"time":"2024-10-02T21:34:27.206083759Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:34:27.205Z\" level=info msg=\"cleaning up pod\" action=killContainers key=myns/mypod/killContainers
{"time":"2024-10-02T21:34:27.206088869Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:34:27.205Z\" level=info msg=\"cleaning up pod\" action=terminateContainers key=myns/mypod/terminateContainers
{"time":"2024-10-02T21:34:57.214838677Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:34:57.207Z\" level=info msg=\"cleaning up pod\" action=killContainers key=myns/mypod/killContainers
{"time":"2024-10-02T21:34:57.217688507Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:34:57.217Z\" level=info msg=\"cleaning up pod\" action=terminateContainers key=myns/mypod/terminateContainers
{"time":"2024-10-02T21:35:27.218840968Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:35:27.210Z\" level=info msg=\"cleaning up pod\" action=terminateContainers key=myns/mypod/terminateContainers
{"time":"2024-10-02T21:35:27.21899863Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:35:27.217Z\" level=info msg=\"cleaning up pod\" action=killContainers key=myns/mypod/killContainers
{"time":"2024-10-02T21:35:57.219513886Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:35:57.211Z\" level=info msg=\"cleaning up pod\" action=terminateContainers key=myns/mypod/terminateContainers
{"time":"2024-10-02T21:36:27.215534417Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:36:27.211Z\" level=info msg=\"cleaning up pod\" action=killContainers key=myns/mypod/killContainers
{"time":"2024-10-02T21:36:27.215548187Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:36:27.213Z\" level=info msg=\"cleaning up pod\" action=terminateContainers key=myns/mypod/terminateContainers
{"time":"2024-10-02T21:36:57.217225673Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:36:57.213Z\" level=info msg=\"cleaning up pod\" action=killContainers key=myns/mypod/killContainers
{"time":"2024-10-02T21:36:57.217234973Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:36:57.214Z\" level=info msg=\"cleaning up pod\" action=terminateContainers key=myns/mypod/terminateContainers
{"time":"2024-10-02T21:37:26.49104221Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:37:24.478Z\" level=info msg=\"cleaning up pod\" action=deletePod key=myns/mypod/deletePod
{"time":"2024-10-02T21:37:27.240338159Z","stream":"stdout","_p":"F","log":"time=\"2024-10-02T21:37:27.215Z\" level=info msg=\"cleaning up pod\" action=killContainers key=myns/mypod/killContainers

kubelet logs groaning with things like this:

21:34:29.809574   37233 eviction_manager.go:395] \"Eviction manager: pods ranked for eviction\" pods=[...there are several here including the one im interested in...i redacted...]"}
21:34:39.810514   37233 eviction_manager.go:614] \"Eviction manager: pod failed to evict\" err=\"timeout waiting to kill pod\" pod=\"myns/mypod\""}
21:34:39.810611   37233 eviction_manager.go:205] \"Eviction manager: pods evicted, waiting for pod to be cleaned up\" pods=[\"myns/mypod\"]"}
21:36:30.289954   37233 eviction_manager.go:439] \"Eviction manager: timed out waiting for pods to be cleaned up\" pods=[\"myns/mypod\"]"}
21:36:30.489498   37233 eviction_manager.go:395] \"Eviction manager: pods ranked for eviction\" pods=[...there are several here including the one im interested in...i redacted...]"}
21:36:40.490431   37233 eviction_manager.go:614] \"Eviction manager: pod failed to evict\" err=\"timeout waiting to kill pod\" pod=\"myns/mypod\""}

kubectl events shows:

 8m34s       Warning   Evicted                 pod/mypod                           kubelet, myhost   The node was low on resource: ephemeral-storage. Threshold quantity: 19295517848, available: 17739364Ki.
 6m23s       Warning   ExceededGracePeriod     pod/mypod                           kubelet, myhost   Container runtime did not kill the pod within specified grace period.                                   
 7m53s       Warning   Evicted                 pod/mypod                           kubelet, myhost   The node was low on resource: ephemeral-storage. Threshold quantity: 12863678246, available: 5765816Ki. 
 7m13s       Warning   Evicted                 pod/mypod                           kubelet, myhost   The node was low on resource: ephemeral-storage. Threshold quantity: 12863678246, available: 1525016Ki. 
 6m33s       Warning   Evicted                 pod/mypod                           kubelet, myhost   The node was low on resource: ephemeral-storage. Threshold quantity: 12863678246, available: 3357484Ki. 

so to summarise seems like kubelet was very slow to actually delete the pod, but from my point of view argo should be able to consider the workflow complete once it has sent instruction to terminate the pod (not when it actually detects that the pod was removed)

for other workflows affected by delay of finish well after step finish time (but no kubelet logs related to eviction) i notice the wf controller sends terminateContainers first then 30 seconds later sends killContainers, i'm thinking it should send the most forceful option always with grace period of 0

i see #4940 added terminationGracePeriodSeconds but it requires podSpecPatch which will slow things down

here is a more common case: nothing odd in kubelet, the main container logs show it actually finished doing work (ie my pod running python code) at 13:17:57 (this also aligns with END TIME of the pod step in argo ui)
both main and wait containers switch from kube_pod_container_status_running=1/kube_pod_container_status_terminated=0 to kube_pod_container_status_running=0/kube_pod_container_status_terminated=1 at 13:23:30 (metrics are only every 30s so between 13:23:00 and 13:23:30)
BUT kube_pod_status_phase shows running until 13:23:30 (metrics are only every 30s so between 13:23:00 and 13:23:30)
workflow finish time in UI shows as 13:23:26

{"time":"2024-10-03T13:17:36.413393037Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:36.303Z\" level=info msg=\"Processing workflow\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:17:36.413403677Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:36.316Z\" level=info msg=\"Updated phase  -> Running\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:17:36.413409447Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:36.319Z\" level=info msg=\"Retry node anotherwf initialized Running\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:17:36.413415557Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:36.320Z\" level=info msg=\"Pod node anotherwf-200756155 initialized Pending\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:17:36.413421207Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:36.361Z\" level=info msg=\"Created pod: anotherwf(0) (anotherwf-mypod)\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:17:36.413426398Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:36.361Z\" level=info msg=\"TaskSet Reconciliation\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:17:36.413431688Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:36.361Z\" level=info msg=reconcileAgentPod namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:17:36.413441938Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:36.389Z\" level=info msg=\"Workflow update successful\" namespace=myns phase=Running resourceVersion=434173271 workflow=anotherwf
{"time":"2024-10-03T13:17:46.87447036Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:46.303Z\" level=info msg=\"Processing workflow\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:17:46.87447558Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:46.306Z\" level=info msg=\"Task-result reconciliation\" namespace=myns numObjs=0 workflow=anotherwf
{"time":"2024-10-03T13:17:46.8744898Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:46.306Z\" level=info msg=\"node changed\" namespace=myns new.message= new.phase=Running new.progress=0/1 nodeID=anotherwf-200756155 old.message= old.phase=Pending old.progress=0/1 workflow=anotherwf
{"time":"2024-10-03T13:17:46.87449513Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:46.308Z\" level=info msg=\"node anotherwf message: retryStrategy.expression evaluated to false\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:17:46.87449961Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:46.308Z\" level=info msg=\"TaskSet Reconciliation\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:17:46.87450394Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:46.308Z\" level=info msg=reconcileAgentPod namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:17:46.87451275Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:17:46.325Z\" level=info msg=\"Workflow update successful\" namespace=myns phase=Running resourceVersion=434173659 workflow=anotherwf
{"time":"2024-10-03T13:18:07.125821521Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:18:07.106Z\" level=info msg=\"Processing workflow\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:18:07.125826271Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:18:07.108Z\" level=info msg=\"Task-result reconciliation\" namespace=myns numObjs=1 workflow=anotherwf
{"time":"2024-10-03T13:18:07.125831851Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:18:07.109Z\" level=info msg=\"task-result changed\" namespace=myns nodeID=anotherwf-200756155 workflow=anotherwf
{"time":"2024-10-03T13:18:07.125845181Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:18:07.109Z\" level=info msg=\"node unchanged\" namespace=myns nodeID=anotherwf-200756155 workflow=anotherwf
{"time":"2024-10-03T13:18:07.125850011Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:18:07.111Z\" level=info msg=\"TaskSet Reconciliation\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:18:07.125859511Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:18:07.111Z\" level=info msg=reconcileAgentPod namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:18:07.324836058Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:18:07.130Z\" level=info msg=\"Workflow update successful\" namespace=myns phase=Running resourceVersion=434174117 workflow=anotherwf
{"time":"2024-10-03T13:23:09.961657465Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:07.107Z\" level=info msg=\"Processing workflow\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:09.961664296Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:07.110Z\" level=info msg=\"Task-result reconciliation\" namespace=myns numObjs=1 workflow=anotherwf
{"time":"2024-10-03T13:23:09.961670926Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:07.110Z\" level=info msg=\"node unchanged\" namespace=myns nodeID=anotherwf-200756155 workflow=anotherwf
{"time":"2024-10-03T13:23:09.961677356Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:07.112Z\" level=info msg=\"TaskSet Reconciliation\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:09.961683526Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:07.112Z\" level=info msg=reconcileAgentPod namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.83742064Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.831Z\" level=info msg=\"Processing workflow\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.83742957Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.833Z\" level=info msg=\"Task-result reconciliation\" namespace=myns numObjs=1 workflow=anotherwf
{"time":"2024-10-03T13:23:26.83743638Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.833Z\" level=info msg=\"node changed\" namespace=myns new.message= new.phase=Succeeded new.progress=0/1 nodeID=anotherwf-200756155 old.message= old.phase=Running old.progress=0/1 workflow=anotherwf
{"time":"2024-10-03T13:23:26.83744259Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.835Z\" level=info msg=\"node anotherwf phase Running -> Succeeded\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.83744849Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.835Z\" level=info msg=\"node anotherwf finished: 2024-10-03 13:23:26.835968997 +0000 UTC\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.83745483Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.836Z\" level=info msg=\"TaskSet Reconciliation\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.83746717Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.836Z\" level=info msg=reconcileAgentPod namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.837473881Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.836Z\" level=info msg=\"Running OnExit handler: exit-handler\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.837480881Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.836Z\" level=info msg=\"Retry node anotherwf-1441074331 initialized Running\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.837497401Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.836Z\" level=info msg=\"Steps node anotherwf-205821078 initialized Running\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.837503671Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.836Z\" level=info msg=\"StepGroup node anotherwf-349863432 initialized Running\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.837510431Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.836Z\" level=info msg=\"Skipping anotherwf.onExit(0)[0].notifyError: when 'Succeeded != Succeeded' evaluated false\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.837517201Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.836Z\" level=info msg=\"Skipped node anotherwf-1196496943 initialized Skipped (message: when 'Succeeded != Succeeded' evaluated false)\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.837523591Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.836Z\" level=info msg=\"Step group node anotherwf-349863432 successful\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.837529761Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.837Z\" level=info msg=\"node anotherwf-349863432 phase Running -> Succeeded\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.837535441Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.837Z\" level=info msg=\"node anotherwf-349863432 finished: 2024-10-03 13:23:26.837013173 +0000 UTC\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.837541272Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.837Z\" level=info msg=\"Outbound nodes of anotherwf-1196496943 is [anotherwf-1196496943]\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.837547332Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.837Z\" level=info msg=\"Outbound nodes of anotherwf-205821078 is [anotherwf-1196496943]\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.87714857Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.837Z\" level=info msg=\"node anotherwf-205821078 phase Running -> Succeeded\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.877183101Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.837Z\" level=info msg=\"node anotherwf-205821078 finished: 2024-10-03 13:23:26.837158986 +0000 UTC\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.877188861Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.837Z\" level=info msg=\"node anotherwf-1441074331 phase Running -> Succeeded\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.877193781Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.837Z\" level=info msg=\"node anotherwf-1441074331 message: retryStrategy.expression evaluated to false\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.877198471Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.837Z\" level=info msg=\"node anotherwf-1441074331 finished: 2024-10-03 13:23:26.837858246 +0000 UTC\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.877203621Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.837Z\" level=info msg=\"Updated phase Running -> Succeeded\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.877215101Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.837Z\" level=info msg=\"Marking workflow completed\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.877221112Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.837Z\" level=info msg=\"Marking workflow as pending archiving\" namespace=myns workflow=anotherwf
{"time":"2024-10-03T13:23:26.877234542Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:26.854Z\" level=info msg=\"Workflow update successful\" namespace=myns phase=Succeeded resourceVersion=434178326 workflow=anotherwf
{"time":"2024-10-03T13:23:27.154668921Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:27.154Z\" level=info msg=\"archiving workflow\" namespace=myns uid=7972a1bd-38bf-43b1-9667-8c0d87204c0e workflow=anotherwf
{"time":"2024-10-03T13:23:27.242124158Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:27.224Z\" level=info msg=\"Queueing Succeeded workflow myns/anotherwf for delete in 1m4s due to TTL\"
{"time":"2024-10-03T13:23:32.001046743Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:23:31.881Z\" level=info msg=\"cleaning up pod\" action=deletePod key=myns/anotherwf-mypod/deletePod
{"time":"2024-10-03T13:24:32.433995643Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:24:32.000Z\" level=info msg=\"Deleting garbage collected workflow 'myns/anotherwf'\"
{"time":"2024-10-03T13:24:32.434057424Z","stream":"stdout","_p":"F","log":"time=\"2024-10-03T13:24:32.052Z\" level=info msg=\"Successfully deleted 'myns/anotherwf'\"

@fyp711
Copy link
Contributor

fyp711 commented Oct 8, 2024

@fyp711 do u think my issue could be related to informer missing things like u raised with #13466 ?

@tooptoop4 It looks different from #13466.

@fyp711
Copy link
Contributor

fyp711 commented Oct 8, 2024

I have two suggestions, you can check if the informer listandwatch has failed(timeout), and you can check if the apiserver load is high.

@tooptoop4
Copy link
Contributor Author

tooptoop4 commented Oct 9, 2024

@shuangkun didn't realize there is another informer in

will tweak this and report back

UPDATE: no luck unfortunately

@shuangkun
Copy link
Member

Does this only happen when it is large-scale?

@tooptoop4
Copy link
Contributor Author

no, it happens everyday even if less than 100 (with single node each) workflows are running. have you looked at your argo_archived_workflows table? perhaps u can also see long seconds gap between node end time and workflow end time

@Joibel
Copy link
Member

Joibel commented Oct 28, 2024

It feels like there is a missing call to woc.requeue() in this scenario, so we're having to wait for a relist...? Perhaps in the scenario that all Pods are complete, but all workflowTaskResults are not we don't requeue?

@tooptoop4
Copy link
Contributor Author

It feels like there is a missing call to woc.requeue() in this scenario, so we're having to wait for a relist...? Perhaps in the scenario that all Pods are complete, but all workflowTaskResults are not we don't requeue?

any idea @jswxstw

@jswxstw
Copy link
Member

jswxstw commented Nov 1, 2024

I initially suspected it might be the reason mentioned by @Joibel, but in v3.4.11, the workflow does not need to wait for WorkflowTaskResults to complete.
There should be other situations causing the controller to wait for a relist, but I can't think of any possible reasons for now.

@Joibel
Copy link
Member

Joibel commented Nov 1, 2024

Ah, yes... that is true, it isn't about WFTaskResults, but I still think we should have called requeue and didn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics solution/outdated This is not up-to-date with the current version type/bug
Projects
None yet
Development

No branches or pull requests

6 participants