Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argo wait hangs when finishedAt is not set on workflow #13550

Open
4 tasks done
ilia-medvedev-codefresh opened this issue Sep 3, 2024 · 4 comments · May be fixed by #13551
Open
4 tasks done

argo wait hangs when finishedAt is not set on workflow #13550

ilia-medvedev-codefresh opened this issue Sep 3, 2024 · 4 comments · May be fixed by #13551
Labels
area/cli The `argo` CLI area/controller Controller issues, panics solution/outdated This is not up-to-date with the current version type/bug

Comments

@ilia-medvedev-codefresh
Copy link

ilia-medvedev-codefresh commented Sep 3, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

When running argo wait on a workflow that was terminated or finished successfully, but did not have the finishedAt status set argo wait <workflow> hangs without response. The expected behavior is for the command to return immediately as the workflow is in a terminal state.

Version(s)

8a67009

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

To my understanding a workflow can have finishedAt field null due to various reasons. I was able to reproduce it when the issue from #13496 was manifested.

To reproduce, create the following RBAC first:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: test-workflow
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: test-workflow
rules:
  - apiGroups: [""]
    resources: ["pods", "configmaps", "events"]
    verbs: ["get", "create", "update", "list", "watch", "patch"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "create", "update", "list", "watch", "patch", "delete"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["get", "create", "update", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: test-workflow
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: test-workflow
subjects:
  - kind: ServiceAccount
    name: test-workflow

Then submit the following workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-wf-
spec:
  archiveLogs: false
  entrypoint: main-dag
  templates:
  - name: sleep
    serviceAccountName: test-workflow
    script:
      workingDir: '/tmp'
      image: 'alpine'
      command: [sh]
      source: |
        sleep 120
  - name: main-dag
    dag:
      tasks:
      - name: sleep
        template: sleep

Then terminate the workflow with argo terminate (once the sleep pod starts)
Now when argo wait is run on that workflow it will hang indefinitely.

We can see that in the status field for the workflow the taskResultsCompletionStatus for the single task of this workflow is set to false, finishedAt is set to null.

This is the complete workflow object with the status:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
    workflows.argoproj.io/pod-name-format: v2
  creationTimestamp: "2024-09-03T06:28:32Z"
  generateName: test-wf-
  generation: 6
  labels:
    workflows.argoproj.io/archive-strategy: "false"
    workflows.argoproj.io/completed: "false"
    workflows.argoproj.io/phase: Failed
  name: test-wf-t4fhk
  namespace: argo-workflows
  resourceVersion: "535170640"
  uid: 6f114300-7b83-453c-bd08-c18e2fbe325c
spec:
  archiveLogs: false
  arguments: {}
  entrypoint: main-dag
  nodeSelector:
    node-type: workflows
  podGC:
    strategy: OnWorkflowCompletion
  serviceAccountName: workflows-default
  shutdown: Terminate
  templates:
  - inputs: {}
    metadata: {}
    name: sleep
    outputs: {}
    script:
      command:
      - sh
      image: alpine
      name: ""
      resources: {}
      source: |
        sleep 120
      workingDir: /tmp
    serviceAccountName: test-workflow
  - dag:
      tasks:
      - arguments: {}
        name: sleep
        template: sleep
    inputs: {}
    metadata: {}
    name: main-dag
    outputs: {}
  tolerations:
  - effect: NoSchedule
    key: codefresh.io
    operator: Equal
    value: workflows
  ttlStrategy:
    secondsAfterCompletion: 86400
    secondsAfterFailure: 86400
    secondsAfterSuccess: 86400
status:
  artifactGCStatus:
    notSpecified: true
  artifactRepositoryRef:
    artifactRepository:
      archiveLogs: true
      s3:
        bucket: runtime-test-1-workflows-artifacts
        endpoint: s3.amazonaws.com
        region: us-east-1
        useSDKCreds: true
    configMap: artifact-repositories
    key: default-v1
    namespace: argo-workflows
  conditions:
  - status: "False"
    type: PodRunning
  finishedAt: null
  message: Stopped with strategy 'Terminate'
  nodes:
    test-wf-t4fhk:
      children:
      - test-wf-t4fhk-1469664994
      displayName: test-wf-t4fhk
      finishedAt: "2024-09-03T06:30:08Z"
      id: test-wf-t4fhk
      name: test-wf-t4fhk
      outboundNodes:
      - test-wf-t4fhk-1469664994
      phase: Failed
      progress: 0/1
      startedAt: "2024-09-03T06:28:32Z"
      templateName: main-dag
      templateScope: local/test-wf-t4fhk
      type: DAG
    test-wf-t4fhk-1469664994:
      boundaryID: test-wf-t4fhk
      displayName: sleep
      finishedAt: "2024-09-03T06:30:08Z"
      hostNodeName: ip-10-146-65-184.ec2.internal
      id: test-wf-t4fhk-1469664994
      message: 'workflow shutdown with strategy:  Terminate'
      name: test-wf-t4fhk.sleep
      phase: Failed
      progress: 0/1
      startedAt: "2024-09-03T06:28:32Z"
      templateName: sleep
      templateScope: local/test-wf-t4fhk
      type: Pod
  phase: Failed
  progress: 0/1
  startedAt: "2024-09-03T06:28:32Z"
  taskResultsCompletionStatus:
    test-wf-t4fhk-sleep-1469664994: false

I realize that the task completion is a separate issue (mentioned above) - but there is also faulty logic in wait that relies only on the finishedAt status - when there are edge cases where the workflow has a terminal phase but finishedAt is not set.

Logs from the workflow controller

Nothing relevant appears in controller logs

Logs from in your workflow's wait container

Not relevant for this issue
@agilgur5 agilgur5 changed the title Argo wait hangs when finishedAt is not set on workflow argo wait hangs when finishedAt is not set on workflow Sep 3, 2024
@jswxstw
Copy link
Member

jswxstw commented Sep 4, 2024

I reproduced it, but workflow stuck Running in my case, not workflow Failed with empty finishedAt.

The root cause is as below:

woc.controller.queuePodForCleanup(pod.Namespace, pod.Name, terminateContainers)
msg := fmt.Sprintf("workflow shutdown with strategy: %s", woc.GetShutdownStrategy())
woc.handleExecutionControlError(nodeID, wfNodesLock, msg)

// It is important that we *never* label pods as completed until we successfully updated the workflow
// Failing to do so means we can have inconsistent state.
// Pods may be labeled multiple times.
woc.queuePodsForCleanup()

Node is marked as Failed before the pod is terminated, causing LabelKeyCompleted to be set to completed in advance and then it(including LabelKeyReportOutputsCompleted) will not be observed by controlelr.

@jswxstw
Copy link
Member

jswxstw commented Sep 4, 2024

workflow Failed with empty finishedAt

@ilia-medvedev-codefresh How could this happen?

@ilia-medvedev-codefresh
Copy link
Author

Yeah @jswxstw seems that you are right - I started investigating this issue on one of my clusters that were running 3.5.4 and this problem existed there for sure. I switched to a local env for testing my changes but at some point probably got mixed up with all the different versions. I now saw that I was running 3.5.4 for the controller when I reproduced the RBAC issue. But nonetheless, I still believe it is worth adding this guard rail to wait since there have already been numerous regressions that caused finishedAt not to be set.

@jswxstw
Copy link
Member

jswxstw commented Sep 4, 2024

In my opinion, the fundamental problem is that the workflow is stuck running. I can't think of any scenario where the workflow is Failed but finishedAt is not set.

woc.wf.Status.FinishedAt = metav1.Time{Time: time.Now().UTC()}

I don't see any special logic that would cause the two to be inconsistent.

@agilgur5 agilgur5 added area/controller Controller issues, panics area/cli The `argo` CLI solution/outdated This is not up-to-date with the current version labels Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cli The `argo` CLI area/controller Controller issues, panics solution/outdated This is not up-to-date with the current version type/bug
Projects
None yet
3 participants