Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.5.10: Workflow stuck Running after pod runs 2min and then is deleted not gracefully #13533

Closed
4 tasks done
jswxstw opened this issue Aug 30, 2024 · 8 comments · Fixed by #13537 or #13798
Closed
4 tasks done
Labels
area/controller Controller issues, panics type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@jswxstw
Copy link
Member

jswxstw commented Aug 30, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

#13454 mark the node as failed after a timeout and mark the workflowtaskresult as completed only when the pod is absent and the node has not been completed in taskResultReconciliation.
If we want to check if pod is absent after a timeout, why use <= here?

return time.Since(node.StartedAt.Time) <= envutil.LookupEnvDurationOr("POD_ABSENT_TIMEOUT", 2*time.Minute)

I am very puzzled by this PR, see #13373 (comment).

I think this PR has two problems:

  • problem1: If pod is deleted in 2min(POD_ABSENT_TIMEOUT), its task result will be marked as completed immediately. However, we cannot confirm that the pod did not exit gracefully and its task result may not have been observed by the controller yet.
  • problem2: If pod is deleted after 2min(POD_ABSENT_TIMEOUT), node will be marked as Error with message pod deleted and its task result will always be incomplete if pod did not exit gracefully.

I reproduced the problem2 as below:

  1. rewrite the executor FinalizeOutput logic to facilitate the simulation of abnormal exit midway
// FinalizeOutput adds a label or annotation to denote that outputs have completed reporting.
func (we *WorkflowExecutor) FinalizeOutput(ctx context.Context) {
	log.Info("FinalizeOutput start...")
	// sleep 5s here, avoid patching task result when exit not gracefully.
        time.Sleep(5 * time.Second)
	err := retryutil.OnError(wait.Backoff{
		Duration: time.Second,
		Factor:   2,
		Jitter:   0.1,
		Steps:    5,
		Cap:      30 * time.Second,
	}, errorsutil.IsTransientErr, func() error {
		err := we.patchTaskResultLabels(ctx, map[string]string{
			common.LabelKeyReportOutputsCompleted: "true",
		})
		if apierr.IsForbidden(err) || apierr.IsNotFound(err) {
			log.WithError(err).Warn("failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/")
			// Only added as a backup in case LabelKeyReportOutputsCompleted could not be set
			err = we.AddAnnotation(ctx, common.AnnotationKeyReportOutputsCompleted, "true")
		}
		return err
	})
	if err != nil {
		we.AddError(err)
	}
	log.Info("FinalizeOutput end...")
}
  1. submit the workflow demo, and delete the pod after it runs 2min.
# kubectl delete pod sleep-example-jr4lc-sleep-3885742982 --force        
# argo get sleep-example-jr4lc
Name:                sleep-example-jr4lc
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:          
 PodRunning          False
Created:             Fri Aug 30 15:38:34 +0800 (29 minutes ago)
Started:             Fri Aug 30 15:38:34 +0800 (29 minutes ago)
Duration:            29 minutes 43 seconds
Progress:            0/1

STEP                    TEMPLATE  PODNAME                               DURATION  MESSAGE
 ✖ sleep-example-jr4lc  main                                                      child 'sleep-example-jr4lc-3885742982' failed  
 └───⚠ sleep            sleep     sleep-example-jr4lc-sleep-3885742982  2m        pod deleted 

workflow spec:

nodes:
    sleep-example-jr4lc:
      children:
      - sleep-example-jr4lc-2587519231
      displayName: sleep-example-jr4lc
      finishedAt: "2024-08-30T07:41:15Z"
      id: sleep-example-jr4lc
      message: child 'sleep-example-jr4lc-3885742982' failed
      name: sleep-example-jr4lc
      outboundNodes:
      - sleep-example-jr4lc-3885742982
      phase: Failed
      progress: 0/1
      startedAt: "2024-08-30T07:38:34Z"
      templateName: main
      templateScope: local/sleep-example-jr4lc
      type: Steps
    sleep-example-jr4lc-2587519231:
      boundaryID: sleep-example-jr4lc
      children:
      - sleep-example-jr4lc-3885742982
      displayName: '[0]'
      finishedAt: "2024-08-30T07:41:15Z"
      id: sleep-example-jr4lc-2587519231
      message: child 'sleep-example-jr4lc-3885742982' failed
      name: sleep-example-jr4lc[0]
      nodeFlag: {}
      phase: Failed
      progress: 0/1
      startedAt: "2024-08-30T07:38:34Z"
      templateScope: local/sleep-example-jr4lc
      type: StepGroup
    sleep-example-jr4lc-3885742982:
      boundaryID: sleep-example-jr4lc
      displayName: sleep
      finishedAt: "2024-08-30T07:41:15Z"
      hostNodeName: k3d-k3s-default-server-0
      id: sleep-example-jr4lc-3885742982
      message: pod deleted
      name: sleep-example-jr4lc[0].sleep
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: sleep-example-jr4lc/sleep-example-jr4lc-sleep-3885742982/main.log
      phase: Error
      progress: 0/1
      startedAt: "2024-08-30T07:38:34Z"
      templateName: sleep
      templateScope: local/sleep-example-jr4lc
      type: Pod
  phase: Running
  progress: 0/1
  startedAt: "2024-08-30T07:38:34Z"
  taskResultsCompletionStatus:
    sleep-example-jr4lc-3885742982: false

Version(s)

latest

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sleep-example-
spec:
  entrypoint: main
  templates:
  - name: main
    steps:
    - - name: sleep
        template: sleep

  - name: sleep
    podSpecPatch: |
      terminationGracePeriodSeconds: 0
    container:
      image: alpine:latest
      command: [sh, -c, sleep 3600]

Logs from the workflow controller

INFO[2024-08-30T15:06:07.597Z] Processing workflow                           Phase= ResourceVersion=2977313 namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:07.602Z] resolved artifact repository                  artifactRepositoryRef="argo/#"
INFO[2024-08-30T15:06:07.602Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.466Z] Updated phase  -> Running                     namespace=argo workflow=sleep-example-f96kd
WARN[2024-08-30T15:06:21.466Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.466Z] was unable to obtain node for , letting display name to be nodeName  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.466Z] Steps node sleep-example-f96kd initialized Running  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.466Z] StepGroup node sleep-example-f96kd-3280373986 initialized Running  namespace=argo workflow=sleep-example-f96kd
WARN[2024-08-30T15:06:21.467Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.467Z] Pod node sleep-example-f96kd-165151271 initialized Pending  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.481Z] Created pod: sleep-example-f96kd[0].sleep (sleep-example-f96kd-sleep-165151271)  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.481Z] Workflow step group node sleep-example-f96kd-3280373986 not yet completed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.482Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.482Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.482Z] Workflow to be dehydrated                     Workflow Size=1442
INFO[2024-08-30T15:06:21.491Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2977322 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.503Z] Processing workflow                           Phase=Running ResourceVersion=2977322 namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.504Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.504Z] node changed                                  namespace=argo new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=sleep-example-f96kd-165151271 old.message= old.phase=Pending old.progress=0/1 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.505Z] Workflow step group node sleep-example-f96kd-3280373986 not yet completed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.505Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.505Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:21.505Z] Workflow to be dehydrated                     Workflow Size=1762
INFO[2024-08-30T15:06:21.512Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2977327 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:22.484Z] Processing workflow                           Phase=Running ResourceVersion=2977327 namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:22.485Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:22.485Z] node unchanged                                namespace=argo nodeID=sleep-example-f96kd-165151271 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:22.486Z] Workflow step group node sleep-example-f96kd-3280373986 not yet completed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:22.486Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:22.486Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:23.653Z] Processing workflow                           Phase=Running ResourceVersion=2977327 namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:23.654Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:23.654Z] node unchanged                                namespace=argo nodeID=sleep-example-f96kd-165151271 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:23.654Z] Workflow step group node sleep-example-f96kd-3280373986 not yet completed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:23.654Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:23.654Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.665Z] Processing workflow                           Phase=Running ResourceVersion=2977327 namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.666Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.666Z] node changed                                  namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=sleep-example-f96kd-165151271 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.667Z] Workflow step group node sleep-example-f96kd-3280373986 not yet completed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.667Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.667Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:06:28.667Z] Workflow to be dehydrated                     Workflow Size=1782
INFO[2024-08-30T15:06:28.673Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2977343 workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.528Z] Processing workflow                           Phase=Running ResourceVersion=2977343 namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.529Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.529Z] Determined controller should timeout for sleep-example-f96kd-165151271  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.529Z] node sleep-example-f96kd-165151271 phase Running -> Failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.529Z] node sleep-example-f96kd-165151271 message: pod was absent  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.529Z] node sleep-example-f96kd-165151271 finished: 2024-08-30 07:07:11.529418633 +0000 UTC  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.529Z] task-result changed                           namespace=argo nodeID=sleep-example-f96kd-165151271 workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Step group node sleep-example-f96kd-3280373986 deemed failed: child 'sleep-example-f96kd-165151271' failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] node sleep-example-f96kd-3280373986 phase Running -> Failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] node sleep-example-f96kd-3280373986 message: child 'sleep-example-f96kd-165151271' failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] node sleep-example-f96kd-3280373986 finished: 2024-08-30 07:07:11.530083416 +0000 UTC  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] step group sleep-example-f96kd-3280373986 was unsuccessful: child 'sleep-example-f96kd-165151271' failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Outbound nodes of sleep-example-f96kd-165151271 is [sleep-example-f96kd-165151271]  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Outbound nodes of sleep-example-f96kd is [sleep-example-f96kd-165151271]  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] node sleep-example-f96kd phase Running -> Failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] node sleep-example-f96kd message: child 'sleep-example-f96kd-165151271' failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] node sleep-example-f96kd finished: 2024-08-30 07:07:11.530243423 +0000 UTC  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Updated phase Running -> Failed               namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Updated message  -> child 'sleep-example-f96kd-165151271' failed  namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Marking workflow completed                    namespace=argo workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:11.530Z] Workflow to be dehydrated                     Workflow Size=2121
INFO[2024-08-30T15:07:11.537Z] Workflow update successful                    namespace=argo phase=Failed resourceVersion=2977364 workflow=sleep-example-f96kd
INFO[2024-08-30T15:07:43.296Z] Processing workflow                           Phase= ResourceVersion=2977379 namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.300Z] resolved artifact repository                  artifactRepositoryRef="argo/#"
INFO[2024-08-30T15:07:43.300Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.300Z] Updated phase  -> Running                     namespace=argo workflow=sleep-example-hb5m4
WARN[2024-08-30T15:07:43.301Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.301Z] was unable to obtain node for , letting display name to be nodeName  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.301Z] Steps node sleep-example-hb5m4 initialized Running  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.301Z] StepGroup node sleep-example-hb5m4-3239910440 initialized Running  namespace=argo workflow=sleep-example-hb5m4
WARN[2024-08-30T15:07:43.301Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.301Z] Pod node sleep-example-hb5m4-3793842069 initialized Pending  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.312Z] Created pod: sleep-example-hb5m4[0].sleep (sleep-example-hb5m4-sleep-3793842069)  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.312Z] Workflow step group node sleep-example-hb5m4-3239910440 not yet completed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.312Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.312Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:43.312Z] Workflow to be dehydrated                     Workflow Size=1445
INFO[2024-08-30T15:07:43.318Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2977383 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.297Z] Processing workflow                           Phase=Running ResourceVersion=2977383 namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.298Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.298Z] node changed                                  namespace=argo new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=sleep-example-hb5m4-3793842069 old.message= old.phase=Pending old.progress=0/1 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.299Z] Workflow step group node sleep-example-hb5m4-3239910440 not yet completed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.299Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.299Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:44.299Z] Workflow to be dehydrated                     Workflow Size=1765
INFO[2024-08-30T15:07:44.307Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2977391 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:45.810Z] Processing workflow                           Phase=Running ResourceVersion=2977391 namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:45.811Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:45.811Z] node unchanged                                namespace=argo nodeID=sleep-example-hb5m4-3793842069 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:45.812Z] Workflow step group node sleep-example-hb5m4-3239910440 not yet completed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:45.812Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:45.812Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.820Z] Processing workflow                           Phase=Running ResourceVersion=2977391 namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.822Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.822Z] node changed                                  namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=sleep-example-hb5m4-3793842069 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.823Z] Workflow step group node sleep-example-hb5m4-3239910440 not yet completed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.823Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.823Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:07:48.823Z] Workflow to be dehydrated                     Workflow Size=1786
INFO[2024-08-30T15:07:48.829Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2977403 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.200Z] Processing workflow                           Phase=Running ResourceVersion=2977403 namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.200Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.201Z] Determined controller should timeout for sleep-example-hb5m4-3793842069  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.201Z] node sleep-example-hb5m4-3793842069 phase Running -> Failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.201Z] node sleep-example-hb5m4-3793842069 message: pod was absent  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.201Z] node sleep-example-hb5m4-3793842069 finished: 2024-08-30 07:08:32.20109289 +0000 UTC  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:32.201Z] task-result changed                           namespace=argo nodeID=sleep-example-hb5m4-3793842069 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] Step group node sleep-example-hb5m4-3239910440 deemed failed: child 'sleep-example-hb5m4-3793842069' failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] node sleep-example-hb5m4-3239910440 phase Running -> Failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] node sleep-example-hb5m4-3239910440 message: child 'sleep-example-hb5m4-3793842069' failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] node sleep-example-hb5m4-3239910440 finished: 2024-08-30 07:08:41.359849788 +0000 UTC  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] step group sleep-example-hb5m4-3239910440 was unsuccessful: child 'sleep-example-hb5m4-3793842069' failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] Outbound nodes of sleep-example-hb5m4-3793842069 is [sleep-example-hb5m4-3793842069]  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.359Z] Outbound nodes of sleep-example-hb5m4 is [sleep-example-hb5m4-3793842069]  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] node sleep-example-hb5m4 phase Running -> Failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] node sleep-example-hb5m4 message: child 'sleep-example-hb5m4-3793842069' failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] node sleep-example-hb5m4 finished: 2024-08-30 07:08:41.360090041 +0000 UTC  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] Updated phase Running -> Failed               namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] Updated message  -> child 'sleep-example-hb5m4-3793842069' failed  namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] Marking workflow completed                    namespace=argo workflow=sleep-example-hb5m4
INFO[2024-08-30T15:08:41.360Z] Workflow to be dehydrated                     Workflow Size=2130
INFO[2024-08-30T15:08:41.367Z] Workflow update successful                    namespace=argo phase=Failed resourceVersion=2977427 workflow=sleep-example-hb5m4
INFO[2024-08-30T15:10:42.828Z] Alloc=6313 TotalAlloc=23796 Sys=32853 NumGC=7 Goroutines=179 
INFO[2024-08-30T15:15:42.828Z] Alloc=6003 TotalAlloc=24181 Sys=32853 NumGC=10 Goroutines=179 
INFO[2024-08-30T15:20:42.828Z] Alloc=5938 TotalAlloc=24352 Sys=32853 NumGC=12 Goroutines=179 
INFO[2024-08-30T15:25:42.828Z] Alloc=5845 TotalAlloc=24587 Sys=32853 NumGC=15 Goroutines=179 
INFO[2024-08-30T15:30:42.828Z] Alloc=5974 TotalAlloc=24930 Sys=32853 NumGC=17 Goroutines=179 
INFO[2024-08-30T15:35:42.828Z] Alloc=5925 TotalAlloc=25195 Sys=32853 NumGC=20 Goroutines=179 
INFO[2024-08-30T15:38:34.827Z] Processing workflow                           Phase= ResourceVersion=2978073 namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.832Z] resolved artifact repository                  artifactRepositoryRef="argo/#"
INFO[2024-08-30T15:38:34.832Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.832Z] Updated phase  -> Running                     namespace=argo workflow=sleep-example-jr4lc
WARN[2024-08-30T15:38:34.832Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.832Z] was unable to obtain node for , letting display name to be nodeName  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.832Z] Steps node sleep-example-jr4lc initialized Running  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.832Z] StepGroup node sleep-example-jr4lc-2587519231 initialized Running  namespace=argo workflow=sleep-example-jr4lc
WARN[2024-08-30T15:38:34.832Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.832Z] Pod node sleep-example-jr4lc-3885742982 initialized Pending  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.844Z] Created pod: sleep-example-jr4lc[0].sleep (sleep-example-jr4lc-sleep-3885742982)  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.844Z] Workflow step group node sleep-example-jr4lc-2587519231 not yet completed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.844Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.845Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:34.845Z] Workflow to be dehydrated                     Workflow Size=1445
INFO[2024-08-30T15:38:34.851Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2978077 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:35.828Z] Processing workflow                           Phase=Running ResourceVersion=2978077 namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:35.829Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:41.760Z] node changed                                  namespace=argo new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=sleep-example-jr4lc-3885742982 old.message= old.phase=Pending old.progress=0/1 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:41.761Z] Workflow step group node sleep-example-jr4lc-2587519231 not yet completed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:41.761Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:41.761Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:41.761Z] Workflow to be dehydrated                     Workflow Size=1765
INFO[2024-08-30T15:38:41.768Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2978102 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:42.761Z] Processing workflow                           Phase=Running ResourceVersion=2978102 namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:42.762Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:49.905Z] node changed                                  namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=sleep-example-jr4lc-3885742982 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:49.905Z] Workflow step group node sleep-example-jr4lc-2587519231 not yet completed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:49.906Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:49.906Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:38:49.906Z] Workflow to be dehydrated                     Workflow Size=1786
INFO[2024-08-30T15:38:49.912Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2978105 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:40:42.828Z] Alloc=6215 TotalAlloc=27810 Sys=32853 NumGC=22 Goroutines=179 
INFO[2024-08-30T15:41:00.059Z] Processing workflow                           Phase=Running ResourceVersion=2978105 namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:00.060Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:13.149Z] task-result changed                           namespace=argo nodeID=sleep-example-jr4lc-3885742982 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.759Z] Workflow pod is missing                       namespace=argo nodeName="sleep-example-jr4lc[0].sleep" nodePhase=Running recentlyStarted=false workflow=sleep-example-jr4lc
ERRO[2024-08-30T15:41:15.759Z] Mark error node                               error="pod deleted" namespace=argo nodeName="sleep-example-jr4lc[0].sleep" workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.760Z] node sleep-example-jr4lc-3885742982 phase Running -> Error  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.760Z] node sleep-example-jr4lc-3885742982 message: pod deleted  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.760Z] node sleep-example-jr4lc-3885742982 finished: 2024-08-30 07:41:15.760163079 +0000 UTC  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] Step group node sleep-example-jr4lc-2587519231 deemed failed: child 'sleep-example-jr4lc-3885742982' failed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] node sleep-example-jr4lc-2587519231 phase Running -> Failed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] node sleep-example-jr4lc-2587519231 message: child 'sleep-example-jr4lc-3885742982' failed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] node sleep-example-jr4lc-2587519231 finished: 2024-08-30 07:41:15.761176068 +0000 UTC  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] step group sleep-example-jr4lc-2587519231 was unsuccessful: child 'sleep-example-jr4lc-3885742982' failed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] Outbound nodes of sleep-example-jr4lc-3885742982 is [sleep-example-jr4lc-3885742982]  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] Outbound nodes of sleep-example-jr4lc is [sleep-example-jr4lc-3885742982]  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] node sleep-example-jr4lc phase Running -> Failed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] node sleep-example-jr4lc message: child 'sleep-example-jr4lc-3885742982' failed  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] node sleep-example-jr4lc finished: 2024-08-30 07:41:15.761561607 +0000 UTC  namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] TaskSet Reconciliation                        namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] reconcileAgentPod                             namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:15.761Z] Workflow to be dehydrated                     Workflow Size=2055
INFO[2024-08-30T15:41:15.769Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=2978157 workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:16.772Z] Processing workflow                           Phase=Running ResourceVersion=2978157 namespace=argo workflow=sleep-example-jr4lc
INFO[2024-08-30T15:41:16.773Z] Task-result reconciliation                    namespace=argo numObjs=1 workflow=sleep-example-jr4lc

Logs from in your workflow's wait container

time="2024-08-30T07:38:36.240Z" level=info msg="Starting Workflow Executor" version=latest+3d41fb2.dirty
time="2024-08-30T07:38:36.245Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-08-30T07:38:36.245Z" level=info msg="Executor initialized" deadline="2024-08-30 07:43:34 +0000 UTC" includeScriptOutput=false namespace=argo podName=sleep-example-jr4lc-sleep-3885742982 templateName=sleep version="&Version{Version:latest+3d41fb2.dirty,BuildDate:2024-08-30T06:46:13Z,GitCommit:3d41fb2329ab5ed63db05cf2cbc373c082a6a7aa,GitTag:untagged,GitTreeState:dirty,GoVersion:go1.22.6,Compiler:gc,Platform:linux/amd64,}"
time="2024-08-30T07:38:36.254Z" level=info msg="Starting deadline monitor"
time="2024-08-30T07:40:59.062Z" level=info msg="Deadline monitor stopped"
time="2024-08-30T07:40:59.062Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2024-08-30T07:40:59.326Z" level=warning msg="Non-transient error: context canceled"
time="2024-08-30T07:40:59.326Z" level=info msg="Main container completed" error="context canceled"
time="2024-08-30T07:40:59.326Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-08-30T07:40:59.326Z" level=info msg="No output parameters"
time="2024-08-30T07:40:59.326Z" level=info msg="No output artifacts"
time="2024-08-30T07:40:59.326Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: sleep-example-jr4lc/sleep-example-jr4lc-sleep-3885742982/main.log"
time="2024-08-30T07:40:59.326Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-08-30T07:40:59.326Z" level=info msg="Saving file to s3" bucket=my-bucket endpoint="minio:9000" key=sleep-example-jr4lc/sleep-example-jr4lc-sleep-3885742982/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-08-30T07:40:59.335Z" level=info msg="Save artifact" artifactName=main-logs duration=8.712478ms error="<nil>" key=sleep-example-jr4lc/sleep-example-jr4lc-sleep-3885742982/main.log
time="2024-08-30T07:40:59.335Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-08-30T07:40:59.335Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-08-30T07:40:59.344Z" level=info msg="Alloc=6468 TotalAlloc=13910 Sys=23637 NumGC=5 Goroutines=7"
time="2024-08-30T07:40:59.344Z" level=info msg="FinalizeOutput start..."
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Aug 30, 2024
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Aug 30, 2024
@agilgur5 agilgur5 added the area/controller Controller issues, panics label Aug 30, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Aug 31, 2024
@agilgur5 agilgur5 added the type/regression Regression from previous behavior (a specific type of bug) label Aug 31, 2024
@isubasinghe
Copy link
Member

@jswxstw

If we want to check if pod is absent after a timeout, why use <= here?

This was a careless logic bug from my end, apoligies for that.

With problem1, which would still be present even if this bug didn't exist, that is a fair point.

Joibel pushed a commit to pipekit/argo-workflows that referenced this issue Sep 19, 2024
@ericblackburn
Copy link

ericblackburn commented Sep 24, 2024

I believe this bug is still present.

  • fix in that other thread

@isubasinghe
Copy link
Member

@ericblackburn these edge cases are hard to reproduce, could you please give us a workflow along with instructions to reproduce?

@ericblackburn
Copy link

The issue of why I was still seeing this behavior is documented in #13537 (comment). Problem and solution has been identified.

@agilgur5 agilgur5 changed the title Workflow stuck Running after pod runs 2min and then is deleted not gracefully v3.5.10: Workflow stuck Running after pod runs 2min and then is deleted not gracefully Oct 8, 2024
@zhucan
Copy link

zhucan commented Dec 4, 2024

problem1: If pod is deleted in 2min(POD_ABSENT_TIMEOUT), its task result will be marked as completed immediately. However, we cannot confirm that the pod did not exit gracefully and its task result may not have been observed by the controller yet.

@jswxstw which pr fix this problem,and how to produce it?

@jswxstw
Copy link
Member Author

jswxstw commented Dec 4, 2024

@jswxstw which pr fix this problem,and how to produce it?

@zhucan Problem 1 was caused by an incorrect comparison operator in the previous PR(see #13533 (comment)), which is fixed by #13537 and #13798 was further improved to handle more special cases.

@zhucan
Copy link

zhucan commented Dec 4, 2024

level=info msg="Task-result reconciliation" namespace=argo-simu-simulation-platform numObjs=0 workflow=simu-ddl-cpu-350767-1707377
@jswxstw I don't know whether its task result may not have been observed by the controller yet or the task result not be created, brecause of the 'numObjs=0', so the "taskResultReconciliation()" step will skip.

@jswxstw
Copy link
Member Author

jswxstw commented Dec 4, 2024

I don't know whether its task result may not have been observed by the controller yet or the task result not be created, brecause of the 'numObjs=0', so the "taskResultReconciliation()" step will skip.

@zhucan This is what cannot be determined. However, InitializeOutput will create task result upon pod starting and we are more concerned with whether FinalizeOutput successfully updates the task result, which is also cannot be accurately determined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
5 participants