v3.5.10: Workflow stuck `Running` after pod runs 2min and then is deleted not gracefully #13533

jswxstw · 2024-08-30T08:10:13Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

#13454 mark the node as failed after a timeout and mark the workflowtaskresult as completed only when the pod is absent and the node has not been completed in taskResultReconciliation.
If we want to check if pod is absent after a timeout, why use <= here?

argo-workflows/workflow/controller/taskresult.go

Line 58 in fed83ca

    
           return time.Since(node.StartedAt.Time) <= envutil.LookupEnvDurationOr("POD_ABSENT_TIMEOUT", 2*time.Minute)

I am very puzzled by this PR, see #13373 (comment).

I think this PR has two problems:

problem1: If pod is deleted in 2min(POD_ABSENT_TIMEOUT), its task result will be marked as completed immediately. However, we cannot confirm that the pod did not exit gracefully and its task result may not have been observed by the controller yet.
problem2: If pod is deleted after 2min(POD_ABSENT_TIMEOUT), node will be marked as Error with message pod deleted and its task result will always be incomplete if pod did not exit gracefully.

I reproduced the problem2 as below:

rewrite the executor FinalizeOutput logic to facilitate the simulation of abnormal exit midway

// FinalizeOutput adds a label or annotation to denote that outputs have completed reporting.
func (we *WorkflowExecutor) FinalizeOutput(ctx context.Context) {
	log.Info("FinalizeOutput start...")
	// sleep 5s here, avoid patching task result when exit not gracefully.
        time.Sleep(5 * time.Second)
	err := retryutil.OnError(wait.Backoff{
		Duration: time.Second,
		Factor:   2,
		Jitter:   0.1,
		Steps:    5,
		Cap:      30 * time.Second,
	}, errorsutil.IsTransientErr, func() error {
		err := we.patchTaskResultLabels(ctx, map[string]string{
			common.LabelKeyReportOutputsCompleted: "true",
		})
		if apierr.IsForbidden(err) || apierr.IsNotFound(err) {
			log.WithError(err).Warn("failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/")
			// Only added as a backup in case LabelKeyReportOutputsCompleted could not be set
			err = we.AddAnnotation(ctx, common.AnnotationKeyReportOutputsCompleted, "true")
		}
		return err
	})
	if err != nil {
		we.AddError(err)
	}
	log.Info("FinalizeOutput end...")
}

submit the workflow demo, and delete the pod after it runs 2min.

# kubectl delete pod sleep-example-jr4lc-sleep-3885742982 --force        
# argo get sleep-example-jr4lc
Name:                sleep-example-jr4lc
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:          
 PodRunning          False
Created:             Fri Aug 30 15:38:34 +0800 (29 minutes ago)
Started:             Fri Aug 30 15:38:34 +0800 (29 minutes ago)
Duration:            29 minutes 43 seconds
Progress:            0/1

STEP                    TEMPLATE  PODNAME                               DURATION  MESSAGE
 ✖ sleep-example-jr4lc  main                                                      child 'sleep-example-jr4lc-3885742982' failed  
 └───⚠ sleep            sleep     sleep-example-jr4lc-sleep-3885742982  2m        pod deleted

workflow spec:

nodes:
    sleep-example-jr4lc:
      children:
      - sleep-example-jr4lc-2587519231
      displayName: sleep-example-jr4lc
      finishedAt: "2024-08-30T07:41:15Z"
      id: sleep-example-jr4lc
      message: child 'sleep-example-jr4lc-3885742982' failed
      name: sleep-example-jr4lc
      outboundNodes:
      - sleep-example-jr4lc-3885742982
      phase: Failed
      progress: 0/1
      startedAt: "2024-08-30T07:38:34Z"
      templateName: main
      templateScope: local/sleep-example-jr4lc
      type: Steps
    sleep-example-jr4lc-2587519231:
      boundaryID: sleep-example-jr4lc
      children:
      - sleep-example-jr4lc-3885742982
      displayName: '[0]'
      finishedAt: "2024-08-30T07:41:15Z"
      id: sleep-example-jr4lc-2587519231
      message: child 'sleep-example-jr4lc-3885742982' failed
      name: sleep-example-jr4lc[0]
      nodeFlag: {}
      phase: Failed
      progress: 0/1
      startedAt: "2024-08-30T07:38:34Z"
      templateScope: local/sleep-example-jr4lc
      type: StepGroup
    sleep-example-jr4lc-3885742982:
      boundaryID: sleep-example-jr4lc
      displayName: sleep
      finishedAt: "2024-08-30T07:41:15Z"
      hostNodeName: k3d-k3s-default-server-0
      id: sleep-example-jr4lc-3885742982
      message: pod deleted
      name: sleep-example-jr4lc[0].sleep
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: sleep-example-jr4lc/sleep-example-jr4lc-sleep-3885742982/main.log
      phase: Error
      progress: 0/1
      startedAt: "2024-08-30T07:38:34Z"
      templateName: sleep
      templateScope: local/sleep-example-jr4lc
      type: Pod
  phase: Running
  progress: 0/1
  startedAt: "2024-08-30T07:38:34Z"
  taskResultsCompletionStatus:
    sleep-example-jr4lc-3885742982: false

Version(s)

latest

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sleep-example-
spec:
  entrypoint: main
  templates:
  - name: main
    steps:
    - - name: sleep
        template: sleep

  - name: sleep
    podSpecPatch: |
      terminationGracePeriodSeconds: 0
    container:
      image: alpine:latest
      command: [sh, -c, sleep 3600]

Logs from the workflow controller

INFO[2024-08-30T15:06:07.597Z] Processing workflow                           Phase= ResourceVersion=2977313 namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:07.602Z] resolved artifact repository                  artifactRepositoryRef="argo/#"
INFO[2024-08-30T15:06:07.602Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.466Z] Updated phase  -> Running                     namespace=argo workflow=sleep-example-f96kd
WARN[2024-08-30T15:06:21.466Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.466Z] was unable to obtain node for , letting display name to be nodeName  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.466Z] Steps node sleep-example-f96kd initialized Running  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.466Z] StepGroup node sleep-example-f96kd-3280373986 initialized Running  namespace=argo workflow=sleep-example-f96kd
WARN[2024-08-30T15:06:21.467Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.467Z] Pod node sleep-example-f96kd-165151271 initialized Pending  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.481Z] Created pod: sleep-example-f96kd[0].sleep (sleep-example-f96kd-sleep-165151271)  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.481Z] Workflow step group node sleep-example-f96kd-3280373986 not yet completed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.482Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.482Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.482Z] Workflow to be dehydrated                     Workflow Size=1442
INFO[2024-08-30T15:06:21.491Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2977322 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.503Z] Processing workflow                           Phase=Running ResourceVersion=2977322 namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.504Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.504Z] node changed                                  namespace=argo new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=sleep-example-f96kd-165151271 old.message= old.phase=Pending old.progress=0/1 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.505Z] Workflow step group node sleep-example-f96kd-3280373986 not yet completed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.505Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.505Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.505Z] Workflow to be dehydrated                     Workflow Size=1762
INFO[2024-08-30T15:06:21.512Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2977327 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:22.484Z] Processing workflow                           Phase=Running ResourceVersion=2977327 namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:22.485Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:22.485Z] node unchanged                                namespace=argo nodeID=sleep-example-f96kd-165151271 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:22.486Z] Workflow step group node sleep-example-f96kd-3280373986 not yet completed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:22.486Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:22.486Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:23.653Z] Processing workflow                           Phase=Running ResourceVersion=2977327 namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:23.654Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:23.654Z] node unchanged                                namespace=argo nodeID=sleep-example-f96kd-165151271 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:23.654Z] Workflow step group node sleep-example-f96kd-3280373986 not yet completed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:23.654Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:23.654Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.665Z] Processing workflow                           Phase=Running ResourceVersion=2977327 namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.666Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.666Z] node changed                                  namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=sleep-example-f96kd-165151271 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.667Z] Workflow step group node sleep-example-f96kd-3280373986 not yet completed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.667Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.667Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.667Z] Workflow to be dehydrated                     Workflow Size=1782
INFO[2024-08-30T15:06:28.673Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2977343 workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.528Z] Processing workflow                           Phase=Running ResourceVersion=2977343 namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.529Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.529Z] Determined controller should timeout for sleep-example-f96kd-165151271  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.529Z] node sleep-example-f96kd-165151271 phase Running -> Failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.529Z] node sleep-example-f96kd-165151271 message: pod was absent  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.529Z] node sleep-example-f96kd-165151271 finished: 2024-08-30 07:07:11.529418633 +0000 UTC  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.529Z] task-result changed                           namespace=argo nodeID=sleep-example-f96kd-165151271 workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Step group node sleep-example-f96kd-3280373986 deemed failed: child 'sleep-example-f96kd-165151271' failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] node sleep-example-f96kd-3280373986 phase Running -> Failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] node sleep-example-f96kd-3280373986 message: child 'sleep-example-f96kd-165151271' failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] node sleep-example-f96kd-3280373986 finished: 2024-08-30 07:07:11.530083416 +0000 UTC  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] step group sleep-example-f96kd-3280373986 was unsuccessful: child 'sleep-example-f96kd-165151271' failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Outbound nodes of sleep-example-f96kd-165151271 is [sleep-example-f96kd-165151271]  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Outbound nodes of sleep-example-f96kd is [sleep-example-f96kd-165151271]  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] node sleep-example-f96kd phase Running -> Failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] node sleep-example-f96kd message: child 'sleep-example-f96kd-165151271' failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] node sleep-example-f96kd finished: 2024-08-30 07:07:11.530243423 +0000 UTC  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Updated phase Running -> Failed               namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Updated message  -> child 'sleep-example-f96kd-165151271' failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Marking workflow completed                    namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Workflow to be dehydrated                     Workflow Size=2121
INFO[2024-08-30T15:07:11.537Z] Workflow update successful                    namespace=argo phase=Failed resourceVersion=2977364 workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:43.296Z] Processing workflow                           Phase= ResourceVersion=2977379 namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.300Z] resolved artifact repository                  artifactRepositoryRef="argo/#"
INFO[2024-08-30T15:07:43.300Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.300Z] Updated phase  -> Running                     namespace=argo workflow=sleep-example-hb5m4
WARN[2024-08-30T15:07:43.301Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.301Z] was unable to obtain node for , letting display name to be nodeName  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.301Z] Steps node sleep-example-hb5m4 initialized Running  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.301Z] StepGroup node sleep-example-hb5m4-3239910440 initialized Running  namespace=argo workflow=sleep-example-hb5m4
WARN[2024-08-30T15:07:43.301Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.301Z] Pod node sleep-example-hb5m4-3793842069 initialized Pending  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.312Z] Created pod: sleep-example-hb5m4[0].sleep (sleep-example-hb5m4-sleep-3793842069)  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.312Z] Workflow step group node sleep-example-hb5m4-3239910440 not yet completed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.312Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.312Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.312Z] Workflow to be dehydrated                     Workflow Size=1445
INFO[2024-08-30T15:07:43.318Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2977383 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.297Z] Processing workflow                           Phase=Running ResourceVersion=2977383 namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.298Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.298Z] node changed                                  namespace=argo new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=sleep-example-hb5m4-3793842069 old.message= old.phase=Pending old.progress=0/1 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.299Z] Workflow step group node sleep-example-hb5m4-3239910440 not yet completed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.299Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.299Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.299Z] Workflow to be dehydrated                     Workflow Size=1765
INFO[2024-08-30T15:07:44.307Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2977391 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:45.810Z] Processing workflow                           Phase=Running ResourceVersion=2977391 namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:45.811Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:45.811Z] node unchanged                                namespace=argo nodeID=sleep-example-hb5m4-3793842069 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:45.812Z] Workflow step group node sleep-example-hb5m4-3239910440 not yet completed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:45.812Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:45.812Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.820Z] Processing workflow                           Phase=Running ResourceVersion=2977391 namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.822Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.822Z] node changed                                  namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=sleep-example-hb5m4-3793842069 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.823Z] Workflow step group node sleep-example-hb5m4-3239910440 not yet completed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.823Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.823Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.823Z] Workflow to be dehydrated                     Workflow Size=1786
INFO[2024-08-30T15:07:48.829Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2977403 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.200Z] Processing workflow                           Phase=Running ResourceVersion=2977403 namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.200Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.201Z] Determined controller should timeout for sleep-example-hb5m4-3793842069  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.201Z] node sleep-example-hb5m4-3793842069 phase Running -> Failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.201Z] node sleep-example-hb5m4-3793842069 message: pod was absent  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.201Z] node sleep-example-hb5m4-3793842069 finished: 2024-08-30 07:08:32.20109289 +0000 UTC  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.201Z] task-result changed                           namespace=argo nodeID=sleep-example-hb5m4-3793842069 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] Step group node sleep-example-hb5m4-3239910440 deemed failed: child 'sleep-example-hb5m4-3793842069' failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] node sleep-example-hb5m4-3239910440 phase Running -> Failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] node sleep-example-hb5m4-3239910440 message: child 'sleep-example-hb5m4-3793842069' failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] node sleep-example-hb5m4-3239910440 finished: 2024-08-30 07:08:41.359849788 +0000 UTC  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] step group sleep-example-hb5m4-3239910440 was unsuccessful: child 'sleep-example-hb5m4-3793842069' failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] Outbound nodes of sleep-example-hb5m4-3793842069 is [sleep-example-hb5m4-3793842069]  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] Outbound nodes of sleep-example-hb5m4 is [sleep-example-hb5m4-3793842069]  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] node sleep-example-hb5m4 phase Running -> Failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] node sleep-example-hb5m4 message: child 'sleep-example-hb5m4-3793842069' failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] node sleep-example-hb5m4 finished: 2024-08-30 07:08:41.360090041 +0000 UTC  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] Updated phase Running -> Failed               namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] Updated message  -> child 'sleep-example-hb5m4-3793842069' failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] Marking workflow completed                    namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] Workflow to be dehydrated                     Workflow Size=2130
INFO[2024-08-30T15:08:41.367Z] Workflow update successful                    namespace=argo phase=Failed resourceVersion=2977427 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:10:42.828Z] Alloc=6313 TotalAlloc=23796 Sys=32853 NumGC=7 Goroutines=179 
INFO[2024-08-30T15:15:42.828Z] Alloc=6003 TotalAlloc=24181 Sys=32853 NumGC=10 Goroutines=179 
INFO[2024-08-30T15:20:42.828Z] Alloc=5938 TotalAlloc=24352 Sys=32853 NumGC=12 Goroutines=179 
INFO[2024-08-30T15:25:42.828Z] Alloc=5845 TotalAlloc=24587 Sys=32853 NumGC=15 Goroutines=179 
INFO[2024-08-30T15:30:42.828Z] Alloc=5974 TotalAlloc=24930 Sys=32853 NumGC=17 Goroutines=179 
INFO[2024-08-30T15:35:42.828Z] Alloc=5925 TotalAlloc=25195 Sys=32853 NumGC=20 Goroutines=179 
INFO[2024-08-30T15:38:34.827Z] Processing workflow                           Phase= ResourceVersion=2978073 namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.832Z] resolved artifact repository                  artifactRepositoryRef="argo/#"
INFO[2024-08-30T15:38:34.832Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.832Z] Updated phase  -> Running                     namespace=argo workflow=sleep-example-jr4lc
WARN[2024-08-30T15:38:34.832Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.832Z] was unable to obtain node for , letting display name to be nodeName  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.832Z] Steps node sleep-example-jr4lc initialized Running  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.832Z] StepGroup node sleep-example-jr4lc-2587519231 initialized Running  namespace=argo workflow=sleep-example-jr4lc
WARN[2024-08-30T15:38:34.832Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.832Z] Pod node sleep-example-jr4lc-3885742982 initialized Pending  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.844Z] Created pod: sleep-example-jr4lc[0].sleep (sleep-example-jr4lc-sleep-3885742982)  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.844Z] Workflow step group node sleep-example-jr4lc-2587519231 not yet completed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.844Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.845Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.845Z] Workflow to be dehydrated                     Workflow Size=1445
INFO[2024-08-30T15:38:34.851Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2978077 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:35.828Z] Processing workflow                           Phase=Running ResourceVersion=2978077 namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:35.829Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:41.760Z] node changed                                  namespace=argo new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=sleep-example-jr4lc-3885742982 old.message= old.phase=Pending old.progress=0/1 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:41.761Z] Workflow step group node sleep-example-jr4lc-2587519231 not yet completed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:41.761Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:41.761Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:41.761Z] Workflow to be dehydrated                     Workflow Size=1765
INFO[2024-08-30T15:38:41.768Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2978102 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:42.761Z] Processing workflow                           Phase=Running ResourceVersion=2978102 namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:42.762Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:49.905Z] node changed                                  namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=sleep-example-jr4lc-3885742982 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:49.905Z] Workflow step group node sleep-example-jr4lc-2587519231 not yet completed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:49.906Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:49.906Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:49.906Z] Workflow to be dehydrated                     Workflow Size=1786
INFO[2024-08-30T15:38:49.912Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2978105 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:40:42.828Z] Alloc=6215 TotalAlloc=27810 Sys=32853 NumGC=22 Goroutines=179 
INFO[2024-08-30T15:41:00.059Z] Processing workflow                           Phase=Running ResourceVersion=2978105 namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:00.060Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:13.149Z] task-result changed                           namespace=argo nodeID=sleep-example-jr4lc-3885742982 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.759Z] Workflow pod is missing                       namespace=argo nodeName="sleep-example-jr4lc[0].sleep" nodePhase=Running recentlyStarted=false workflow=sleep-example-jr4lc
ERRO[2024-08-30T15:41:15.759Z] Mark error node                               error="pod deleted" namespace=argo nodeName="sleep-example-jr4lc[0].sleep" workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.760Z] node sleep-example-jr4lc-3885742982 phase Running -> Error  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.760Z] node sleep-example-jr4lc-3885742982 message: pod deleted  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.760Z] node sleep-example-jr4lc-3885742982 finished: 2024-08-30 07:41:15.760163079 +0000 UTC  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] Step group node sleep-example-jr4lc-2587519231 deemed failed: child 'sleep-example-jr4lc-3885742982' failed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] node sleep-example-jr4lc-2587519231 phase Running -> Failed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] node sleep-example-jr4lc-2587519231 message: child 'sleep-example-jr4lc-3885742982' failed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] node sleep-example-jr4lc-2587519231 finished: 2024-08-30 07:41:15.761176068 +0000 UTC  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] step group sleep-example-jr4lc-2587519231 was unsuccessful: child 'sleep-example-jr4lc-3885742982' failed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] Outbound nodes of sleep-example-jr4lc-3885742982 is [sleep-example-jr4lc-3885742982]  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] Outbound nodes of sleep-example-jr4lc is [sleep-example-jr4lc-3885742982]  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] node sleep-example-jr4lc phase Running -> Failed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] node sleep-example-jr4lc message: child 'sleep-example-jr4lc-3885742982' failed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] node sleep-example-jr4lc finished: 2024-08-30 07:41:15.761561607 +0000 UTC  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] Workflow to be dehydrated                     Workflow Size=2055
INFO[2024-08-30T15:41:15.769Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2978157 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:16.772Z] Processing workflow                           Phase=Running ResourceVersion=2978157 namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:16.773Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-jr4lc

Logs from in your workflow's wait container

time="2024-08-30T07:38:36.240Z" level=info msg="Starting Workflow Executor" version=latest+3d41fb2.dirty
time="2024-08-30T07:38:36.245Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-08-30T07:38:36.245Z" level=info msg="Executor initialized" deadline="2024-08-30 07:43:34 +0000 UTC" includeScriptOutput=false namespace=argo podName=sleep-example-jr4lc-sleep-3885742982 templateName=sleep version="&Version{Version:latest+3d41fb2.dirty,BuildDate:2024-08-30T06:46:13Z,GitCommit:3d41fb2329ab5ed63db05cf2cbc373c082a6a7aa,GitTag:untagged,GitTreeState:dirty,GoVersion:go1.22.6,Compiler:gc,Platform:linux/amd64,}"
time="2024-08-30T07:38:36.254Z" level=info msg="Starting deadline monitor"
time="2024-08-30T07:40:59.062Z" level=info msg="Deadline monitor stopped"
time="2024-08-30T07:40:59.062Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2024-08-30T07:40:59.326Z" level=warning msg="Non-transient error: context canceled"
time="2024-08-30T07:40:59.326Z" level=info msg="Main container completed" error="context canceled"
time="2024-08-30T07:40:59.326Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-08-30T07:40:59.326Z" level=info msg="No output parameters"
time="2024-08-30T07:40:59.326Z" level=info msg="No output artifacts"
time="2024-08-30T07:40:59.326Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: sleep-example-jr4lc/sleep-example-jr4lc-sleep-3885742982/main.log"
time="2024-08-30T07:40:59.326Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-08-30T07:40:59.326Z" level=info msg="Saving file to s3" bucket=my-bucket endpoint="minio:9000" key=sleep-example-jr4lc/sleep-example-jr4lc-sleep-3885742982/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-08-30T07:40:59.335Z" level=info msg="Save artifact" artifactName=main-logs duration=8.712478ms error="<nil>" key=sleep-example-jr4lc/sleep-example-jr4lc-sleep-3885742982/main.log
time="2024-08-30T07:40:59.335Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-08-30T07:40:59.335Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-08-30T07:40:59.344Z" level=info msg="Alloc=6468 TotalAlloc=13910 Sys=23637 NumGC=5 Goroutines=7"
time="2024-08-30T07:40:59.344Z" level=info msg="FinalizeOutput start..."

The text was updated successfully, but these errors were encountered:

Fixes argoproj#13533 Signed-off-by: oninowang <[email protected]>

isubasinghe · 2024-09-02T06:08:24Z

@jswxstw

If we want to check if pod is absent after a timeout, why use <= here?

This was a careless logic bug from my end, apoligies for that.

With problem1, which would still be present even if this bug didn't exist, that is a fair point.

Fixes argoproj#13533 (argoproj#13537) Signed-off-by: oninowang <[email protected]>

Fixes #13533 (#13537) Signed-off-by: oninowang <[email protected]>

ericblackburn · 2024-09-24T15:49:12Z

I believe this bug is still present.

fix in that other thread

isubasinghe · 2024-09-25T00:03:13Z

@ericblackburn these edge cases are hard to reproduce, could you please give us a workflow along with instructions to reproduce?

ericblackburn · 2024-09-27T18:11:03Z

The issue of why I was still seeing this behavior is documented in #13537 (comment). Problem and solution has been identified.

#13533 (#13798) Signed-off-by: isubasinghe <[email protected]>

zhucan · 2024-12-04T06:28:41Z

problem1: If pod is deleted in 2min(POD_ABSENT_TIMEOUT), its task result will be marked as completed immediately. However, we cannot confirm that the pod did not exit gracefully and its task result may not have been observed by the controller yet.

@jswxstw which pr fix this problem，and how to produce it?

jswxstw · 2024-12-04T06:44:55Z

@jswxstw which pr fix this problem，and how to produce it?

@zhucan Problem 1 was caused by an incorrect comparison operator in the previous PR(see #13533 (comment)), which is fixed by #13537 and #13798 was further improved to handle more special cases.

zhucan · 2024-12-04T06:53:40Z

level=info msg="Task-result reconciliation" namespace=argo-simu-simulation-platform numObjs=0 workflow=simu-ddl-cpu-350767-1707377
@jswxstw I don't know whether its task result may not have been observed by the controller yet or the task result not be created， brecause of the 'numObjs=0', so the "taskResultReconciliation()" step will skip.

jswxstw · 2024-12-04T07:22:39Z

I don't know whether its task result may not have been observed by the controller yet or the task result not be created， brecause of the 'numObjs=0', so the "taskResultReconciliation()" step will skip.

@zhucan This is what cannot be determined. However, InitializeOutput will create task result upon pod starting and we are more concerned with whether FinalizeOutput successfully updates the task result, which is also cannot be accurately determined.

jswxstw added the type/bug label Aug 30, 2024

jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Aug 30, 2024

fix: Mark task result as completed if pod has been deleted for a while.

ddb6ab1

Fixes argoproj#13533 Signed-off-by: oninowang <[email protected]>

jswxstw mentioned this issue Aug 30, 2024

fix: Mark task result as completed if pod has been deleted for a while. Fixes #13533 #13537

Merged

jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Aug 30, 2024

fix: Mark task result as completed if pod has been deleted for a while.

e337359

Fixes argoproj#13533 Signed-off-by: oninowang <[email protected]>

agilgur5 added the area/controller Controller issues, panics label Aug 30, 2024

agilgur5 added this to the v3.5.x patches milestone Aug 31, 2024

agilgur5 added the type/regression Regression from previous behavior (a specific type of bug) label Aug 31, 2024

isubasinghe closed this as completed in #13537 Sep 2, 2024

isubasinghe closed this as completed in 8a67009 Sep 2, 2024

Joibel pushed a commit to pipekit/argo-workflows that referenced this issue Sep 19, 2024

fix: Mark task result as completed if pod has been deleted for a while.

784a000

Fixes argoproj#13533 (argoproj#13537) Signed-off-by: oninowang <[email protected]>

Joibel pushed a commit that referenced this issue Sep 20, 2024

fix: Mark task result as completed if pod has been deleted for a while.

b0ee16d

Fixes #13533 (#13537) Signed-off-by: oninowang <[email protected]>

agilgur5 changed the title ~~Workflow stuck Running after pod runs 2min and then is deleted not gracefully~~ v3.5.10: Workflow stuck Running after pod runs 2min and then is deleted not gracefully Oct 8, 2024

jswxstw mentioned this issue Oct 9, 2024

v3.5.8: workflow shutdown with strategy: Terminate, but stuck in Running #13726

Closed

4 tasks

isubasinghe mentioned this issue Oct 23, 2024

fix: mark taskresult complete when failed or error. Fixes #12993, Fixes #13533 #13798

Merged

isubasinghe added a commit that referenced this issue Oct 28, 2024

fix: mark taskresult complete when failed or error. Fixes #12993, Fixes

eb23eb6

#13533 (#13798) Signed-off-by: isubasinghe <[email protected]>

isubasinghe added a commit that referenced this issue Oct 30, 2024

fix: mark taskresult complete when failed or error. Fixes #12993, Fixes

60b4287

#13533 (#13798) Signed-off-by: isubasinghe <[email protected]>

isubasinghe added a commit that referenced this issue Oct 30, 2024

fix: mark taskresult complete when failed or error. Fixes #12993, Fixes

8fe8de2

#13533 (#13798) Signed-off-by: isubasinghe <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.5.10: Workflow stuck `Running` after pod runs 2min and then is deleted not gracefully #13533

v3.5.10: Workflow stuck `Running` after pod runs 2min and then is deleted not gracefully #13533

jswxstw commented Aug 30, 2024 •

edited

Loading

isubasinghe commented Sep 2, 2024

ericblackburn commented Sep 24, 2024 •

edited

Loading

isubasinghe commented Sep 25, 2024

ericblackburn commented Sep 27, 2024

zhucan commented Dec 4, 2024

jswxstw commented Dec 4, 2024

zhucan commented Dec 4, 2024

jswxstw commented Dec 4, 2024

v3.5.10: Workflow stuck Running after pod runs 2min and then is deleted not gracefully #13533

v3.5.10: Workflow stuck Running after pod runs 2min and then is deleted not gracefully #13533

Comments

jswxstw commented Aug 30, 2024 • edited Loading

Pre-requisites

What happened? What did you expect to happen?

Version(s)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

isubasinghe commented Sep 2, 2024

ericblackburn commented Sep 24, 2024 • edited Loading

isubasinghe commented Sep 25, 2024

ericblackburn commented Sep 27, 2024

zhucan commented Dec 4, 2024

jswxstw commented Dec 4, 2024

zhucan commented Dec 4, 2024

jswxstw commented Dec 4, 2024

v3.5.10: Workflow stuck `Running` after pod runs 2min and then is deleted not gracefully #13533

v3.5.10: Workflow stuck `Running` after pod runs 2min and then is deleted not gracefully #13533

jswxstw commented Aug 30, 2024 •

edited

Loading

ericblackburn commented Sep 24, 2024 •

edited

Loading