Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

signaled container: wait error: unable to upgrade connection: container not found. node Succeeded but wf not progressing #13627

Open
2 of 4 tasks
heyleke opened this issue Sep 19, 2024 · 12 comments
Labels

Comments

@heyleke
Copy link
Contributor

heyleke commented Sep 19, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

main container of a pod finished, wait container uploaded artifact, workflow controller proceeded to 'Succeeded', but workflow didn't continue. Probably due to wait container not reachable.

Problem is hard to reproduce, but we keep logs of all workflows in loki/promtail, so those are available.

Reproduction not done with 'latests' as that is not possible in production environment atm.

Version(s)

v3.5.10

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/pod-name-format: v1
  generation: 3
  labels:
    app: foobar
  name: foobar
  namespace: zoo
spec:
  entrypoint: zoo
  podGC:
    strategy: OnPodCompletion
  templates:
  - name: zoo
    steps:
    - - arguments:
          parameters:
          - name: excluded_node
            value: ""
          - name: log_postfix
            value: ""
        name: getBatchlist
        template: runzoo
        when: false == true
    - - arguments:
          parameters:
          - name: excluded_node
            value: ""
          - name: log_postfix
            value: ""
        continueOn:
          failed: true
        name: runTestbatch
        template: runzoo
        when: false == false
    - - arguments:
          parameters:
          - name: excluded_node
            value: '{{steps.runTestbatch.hostNodeName}}'
          - name: log_postfix
            value: -rerun-1
        continueOn:
          failed: true
        name: rerunTestbatch
        template: runzoo
        when: ( false == false && {{steps.runTestbatch.status}} != Succeeded )
  - affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: NotIn
              values:
              - '{{inputs.parameters.excluded_node}}'
    container:
      args:
      - -c
      - sleep 10
      command:
      - sh
      env:
      image: alpine:3.7
      imagePullPolicy: IfNotPresent
      name: zoo-generic
    inputs:
      parameters:
      - name: excluded_node
      - name: log_postfix
    name: runzoo
    outputs:
      artifacts:
      - archive:
          none: {}
        name: zoo
        optional: true
        path: foobar
    podSpecPatch: '{"enableServiceLinks": false}'
    retryStrategy:
      affinity:
        nodeAntiAffinity: {}
      limit: 1
      retryPolicy: OnError
    serviceAccountName: zoo
    timeout: 65m

Logs from the workflow controller

# disclaimer: only loglines related to this workflow: see below for full logs:
time="2024-09-18T12:43:22.622Z" level=info msg="Steps node foobar initialized Running" namespace=zoo workflow=foobar
time="2024-09-18T12:43:22.622Z" level=info msg="StepGroup node foobar-2315961973 initialized Running" namespace=zoo workflow=foobar
time="2024-09-18T12:43:22.622Z" level=info msg="Skipping foobar[0].getBatchlist: when 'false == true' evaluated false" namespace=zoo workflow=foobar
time="2024-09-18T12:43:22.622Z" level=info msg="Skipped node foobar-3698260353 initialized Skipped (message: when 'false == true' evaluated false)" namespace=zoo workflow=foobar
time="2024-09-18T12:43:22.622Z" level=info msg="Step group node foobar-2315961973 successful" namespace=zoo workflow=foobar
time="2024-09-18T12:43:22.622Z" level=info msg="node foobar-2315961973 phase Running -> Succeeded" namespace=zoo workflow=foobar
time="2024-09-18T12:43:22.622Z" level=info msg="node foobar-2315961973 finished: 2024-09-18 12:43:22.622966509 +0000 UTC" namespace=zoo workflow=foobar
time="2024-09-18T12:43:22.623Z" level=info msg="StepGroup node foobar-2248998592 initialized Running" namespace=zoo workflow=foobar
time="2024-09-18T12:43:22.623Z" level=info msg="SG Outbound nodes of foobar-3698260353 are [foobar-3698260353]" namespace=zoo workflow=foobar
time="2024-09-18T12:43:22.623Z" level=info msg="Retry node foobar-3432390641 initialized Running" namespace=zoo workflow=foobar
time="2024-09-18T12:43:22.624Z" level=info msg="Pod node foobar-1516080840 initialized Pending" namespace=zoo workflow=foobar
time="2024-09-18T12:43:22.688Z" level=info msg="Created pod: foobar[1].runTestbatch(0) (foobar-1516080840)" namespace=zoo workflow=foobar
time="2024-09-18T12:43:22.688Z" level=info msg="Workflow step group node foobar-2248998592 not yet completed" namespace=zoo workflow=foobar
time="2024-09-18T12:43:32.693Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Running new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Pending old.progress=0/1 workflow=foobar
time="2024-09-18T12:43:32.693Z" level=info msg="SG Outbound nodes of foobar-3698260353 are [foobar-3698260353]" namespace=zoo workflow=foobar
time="2024-09-18T12:43:32.694Z" level=info msg="Workflow step group node foobar-2248998592 not yet completed" namespace=zoo workflow=foobar
time="2024-09-18T12:43:42.711Z" level=info msg="node unchanged" namespace=zoo nodeID=foobar-1516080840 workflow=foobar
time="2024-09-18T12:43:42.711Z" level=info msg="SG Outbound nodes of foobar-3698260353 are [foobar-3698260353]" namespace=zoo workflow=foobar
time="2024-09-18T12:43:42.713Z" level=info msg="Workflow step group node foobar-2248998592 not yet completed" namespace=zoo workflow=foobar
time="2024-09-18T13:03:42.712Z" level=info msg="node unchanged" namespace=zoo nodeID=foobar-1516080840 workflow=foobar
time="2024-09-18T13:03:42.713Z" level=info msg="SG Outbound nodes of foobar-3698260353 are [foobar-3698260353]" namespace=zoo workflow=foobar
time="2024-09-18T13:03:42.714Z" level=info msg="Workflow step group node foobar-2248998592 not yet completed" namespace=zoo workflow=foobar
time="2024-09-18T13:08:25.406Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Running new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:08:25.407Z" level=info msg="SG Outbound nodes of foobar-3698260353 are [foobar-3698260353]" namespace=zoo workflow=foobar
time="2024-09-18T13:08:25.408Z" level=info msg="Workflow step group node foobar-2248998592 not yet completed" namespace=zoo workflow=foobar
time="2024-09-18T13:08:25.412Z" level=info msg="cleaning up pod" action=terminateContainers key=zoo/foobar-1516080840/terminateContainers
time="2024-09-18T13:08:25.412Z" level=info msg="https://172.18.64.1:443/api/v1/namespaces/zoo/pods/foobar-1516080840/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-09-18T13:08:25.437Z" level=info msg="signaled container" container=wait error="unable to upgrade connection: container not found (\"wait\")" namespace=zoo pod=foobar-1516080840 stderr="<nil>" stdout="<nil>"
time="2024-09-18T13:08:35.426Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:08:45.429Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:08:55.431Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:08:55.438Z" level=info msg="cleaning up pod" action=killContainers key=zoo/foobar-1516080840/killContainers
time="2024-09-18T13:09:05.433Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:09:15.437Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:09:25.439Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:09:35.442Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:09:45.444Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:09:55.447Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:10:05.449Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:10:15.451Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:10:25.453Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:10:35.455Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:10:45.458Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar

#full logs around that 'error':
time="2024-09-18T13:08:25.408Z" level=info msg=reconcileAgentPod namespace=zoo workflow=foobar
time="2024-09-18T13:08:25.408Z" level=info msg="Workflow to be dehydrated" Workflow Size=6846
time="2024-09-18T13:08:25.412Z" level=info msg="cleaning up pod" action=terminateContainers key=zoo/foobar-1516080840/terminateContainers
time="2024-09-18T13:08:25.412Z" level=info msg="https://172.18.64.1:443/api/v1/namespaces/zoo/pods/foobar-1516080840/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-09-18T13:08:25.429Z" level=info msg="Workflow update successful" namespace=zoo phase=Running resourceVersion=1608334928 workflow=foobar
time="2024-09-18T13:08:25.437Z" level=info msg="signaled container" container=wait error="unable to upgrade connection: container not found (\"wait\")" namespace=zoo pod=foobar-1516080840 stderr="<nil>" stdout="<nil>"
time="2024-09-18T13:08:26.450Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1608281724 namespace=zoo workflow=53258
time="2024-09-18T13:08:26.450Z" level=info msg="Task-result reconciliation" namespace=zoo numObjs=0 workflow=53258
time="2024-09-18T13:08:26.450Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zoo workflow=53258
time="2024-09-18T13:08:26.450Z" level=info msg="node unchanged" namespace=zoo nodeID=53258 workflow=53258
time="2024-09-18T13:08:26.450Z" level=info msg="SG Outbound nodes of 53258 are [53258]" namespace=zoo workflow=53258
time="2024-09-18T13:08:26.452Z" level=info msg="Workflow step group node 53258 not yet completed" namespace=zoo workflow=53258
time="2024-09-18T13:08:26.452Z" level=info msg="TaskSet Reconciliation" namespace=zoo workflow=53258
time="2024-09-18T13:08:26.452Z" level=info msg=reconcileAgentPod namespace=zoo workflow=53258
time="2024-09-18T13:08:29.265Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1608334734 namespace=zoo workflow=53260
time="2024-09-18T13:08:29.266Z" level=info msg="Task-result reconciliation" namespace=zoo numObjs=0 workflow=53260
time="2024-09-18T13:08:29.266Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zoo workflow=53260
time="2024-09-18T13:08:29.266Z" level=info msg="node unchanged" namespace=zoo nodeID=53260 workflow=53260
time="2024-09-18T13:08:29.266Z" level=info msg="SG Outbound nodes of 53260 are [53260]" namespace=zoo workflow=53260
time="2024-09-18T13:08:29.267Z" level=info msg="Workflow step group node 53260 not yet completed" namespace=zoo workflow=53260
time="2024-09-18T13:08:29.268Z" level=info msg="TaskSet Reconciliation" namespace=zoo workflow=53260
time="2024-09-18T13:08:29.268Z" level=info msg=reconcileAgentPod namespace=zoo workflow=53260
time="2024-09-18T13:08:29.524Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1608334741 namespace=zoo workflow=53258
time="2024-09-18T13:08:29.524Z" level=info msg="Task-result reconciliation" namespace=zoo numObjs=0 workflow=53258
time="2024-09-18T13:08:29.524Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.524Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.524Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=53258 old.message= old.phase=Running old.progress=0/1 workflow=53258
time="2024-09-18T13:08:29.525Z" level=info msg="SG Outbound nodes of 53258 are [53258]" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.525Z" level=info msg="node 53258 phase Running -> Succeeded" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.525Z" level=info msg="node 53258 finished: 2024-09-18 13:08:29.525870166 +0000 UTC" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.525Z" level=info msg="Step group node 53258 successful" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.525Z" level=info msg="node 53258 phase Running -> Succeeded" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.525Z" level=info msg="node 53258 finished: 2024-09-18 13:08:29.52593968 +0000 UTC" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.525Z" level=info msg="StepGroup node 53258 initialized Running" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.525Z" level=info msg="SG Outbound nodes of 53258 are [53258]" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="Skipping 53258[2].rerunTestbatch: when '( false == false && Succeeded != Succeeded )' evaluated false" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="Skipped node 53258 initialized Skipped (message: when '( false == false && Succeeded != Succeeded )' evaluated false)" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="Step group node 53258 successful" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="node 53258 phase Running -> Succeeded" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="node 53258 finished: 2024-09-18 13:08:29.526195476 +0000 UTC" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="Outbound nodes of 53258 is [53258]" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="Outbound nodes of 53258 is [53258]" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="node 53258 phase Running -> Succeeded" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="node 53258 finished: 2024-09-18 13:08:29.526252173 +0000 UTC" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="TaskSet Reconciliation" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg=reconcileAgentPod namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="Updated phase Running -> Succeeded" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="Marking workflow completed" namespace=zoo workflow=53258
time="2024-09-18T13:08:29.526Z" level=info msg="Workflow to be dehydrated" Workflow Size=8284
time="2024-09-18T13:08:29.570Z" level=info msg="Workflow update successful" namespace=zoo phase=Succeeded resourceVersion=1608335091 workflow=53258
time="2024-09-18T13:08:30.041Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1608281827 namespace=zoo workflow=53258
time="2024-09-18T13:08:30.042Z" level=info msg="Task-result reconciliation" namespace=zoo numObjs=0 workflow=53258
time="2024-09-18T13:08:30.042Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zoo workflow=53258
time="2024-09-18T13:08:30.042Z" level=info msg="node unchanged" namespace=zoo nodeID=53258 workflow=53258
time="2024-09-18T13:08:30.042Z" level=info msg="SG Outbound nodes of 53258 are [53258]" namespace=zoo workflow=53258
time="2024-09-18T13:08:30.043Z" level=info msg="Workflow step group node 53258 not yet completed" namespace=zoo workflow=53258
time="2024-09-18T13:08:30.043Z" level=info msg="TaskSet Reconciliation" namespace=zoo workflow=53258
time="2024-09-18T13:08:30.043Z" level=info msg=reconcileAgentPod namespace=zoo workflow=53258
time="2024-09-18T13:08:30.284Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1608334764 namespace=zoo workflow=53260
time="2024-09-18T13:08:30.284Z" level=info msg="Task-result reconciliation" namespace=zoo numObjs=0 workflow=53260
time="2024-09-18T13:08:30.284Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zoo workflow=53260
time="2024-09-18T13:08:30.284Z" level=info msg="node unchanged" namespace=zoo nodeID=53260 workflow=53260
time="2024-09-18T13:08:30.285Z" level=info msg="SG Outbound nodes of 53260 are [53260]" namespace=zoo workflow=53260
time="2024-09-18T13:08:30.286Z" level=info msg="Workflow step group node 53260 not yet completed" namespace=zoo workflow=53260
time="2024-09-18T13:08:30.286Z" level=info msg="TaskSet Reconciliation" namespace=zoo workflow=53260
time="2024-09-18T13:08:30.286Z" level=info msg=reconcileAgentPod namespace=zoo workflow=53260
time="2024-09-18T13:08:33.816Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1608334873 namespace=zoo workflow=53260
time="2024-09-18T13:08:33.816Z" level=info msg="Task-result reconciliation" namespace=zoo numObjs=0 workflow=53260
time="2024-09-18T13:08:33.817Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zoo workflow=53260
time="2024-09-18T13:08:33.817Z" level=info msg="node unchanged" namespace=zoo nodeID=53260 workflow=53260
time="2024-09-18T13:08:33.817Z" level=info msg="SG Outbound nodes of 53260 are [53260]" namespace=zoo workflow=53260
time="2024-09-18T13:08:33.818Z" level=info msg="Workflow step group node 53260 not yet completed" namespace=zoo workflow=53260
time="2024-09-18T13:08:33.818Z" level=info msg="TaskSet Reconciliation" namespace=zoo workflow=53260
time="2024-09-18T13:08:33.818Z" level=info msg=reconcileAgentPod namespace=zoo workflow=53260
time="2024-09-18T13:08:34.032Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1608272093 namespace=zoo workflow=53258
time="2024-09-18T13:08:34.033Z" level=info msg="Task-result reconciliation" namespace=zoo numObjs=0 workflow=53258
time="2024-09-18T13:08:34.033Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zoo workflow=53258
time="2024-09-18T13:08:34.033Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zoo workflow=53258
time="2024-09-18T13:08:34.033Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Running new.progress=0/1 nodeID=53258 old.message= old.phase=Running old.progress=0/1 workflow=53258
time="2024-09-18T13:08:34.033Z" level=info msg="SG Outbound nodes of 53258 are [53258]" namespace=zoo workflow=53258
time="2024-09-18T13:08:34.035Z" level=info msg="Workflow step group node 53258 not yet completed" namespace=zoo workflow=53258
time="2024-09-18T13:08:34.035Z" level=info msg="TaskSet Reconciliation" namespace=zoo workflow=53258
time="2024-09-18T13:08:34.035Z" level=info msg=reconcileAgentPod namespace=zoo workflow=53258
time="2024-09-18T13:08:34.035Z" level=info msg="Workflow to be dehydrated" Workflow Size=6842
time="2024-09-18T13:08:34.038Z" level=info msg="cleaning up pod" action=terminateContainers key=zoo/53258/terminateContainers
time="2024-09-18T13:08:34.066Z" level=info msg="Workflow update successful" namespace=zoo phase=Running resourceVersion=1608335237 workflow=53258
time="2024-09-18T13:08:34.578Z" level=info msg="cleaning up pod" action=deletePod key=zoo/53258/deletePod
time="2024-09-18T13:08:35.426Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1608334928 namespace=zoo workflow=foobar
time="2024-09-18T13:08:35.426Z" level=info msg="Task-result reconciliation" namespace=zoo numObjs=0 workflow=foobar
time="2024-09-18T13:08:35.426Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zoo workflow=foobar
time="2024-09-18T13:08:35.426Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zoo workflow=foobar
time="2024-09-18T13:08:35.426Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Succeeded new.progress=0/1 nodeID=foobar-1516080840 old.message= old.phase=Running old.progress=0/1 workflow=foobar
time="2024-09-18T13:08:35.426Z" level=info msg="pod reconciliation didn't complete, will retry" namespace=zoo workflow=foobar
time="2024-09-18T13:08:35.981Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1608334754 namespace=zar workflow=je4jo7q54v
time="2024-09-18T13:08:35.981Z" level=info msg="Task-result reconciliation" namespace=zar numObjs=0 workflow=je4jo7q54v
time="2024-09-18T13:08:35.981Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zar workflow=je4jo7q54v
time="2024-09-18T13:08:35.982Z" level=info msg="node changed" namespace=zar new.message= new.phase=Running new.progress=0/1 nodeID=je4jo7q54v old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=je4jo7q54v
time="2024-09-18T13:08:35.982Z" level=info msg="TaskSet Reconciliation" namespace=zar workflow=je4jo7q54v
time="2024-09-18T13:08:35.982Z" level=info msg=reconcileAgentPod namespace=zar workflow=je4jo7q54v
time="2024-09-18T13:08:35.982Z" level=info msg="Workflow to be dehydrated" Workflow Size=8192
time="2024-09-18T13:08:36.018Z" level=info msg="Workflow update successful" namespace=zar phase=Running resourceVersion=1608335287 workflow=je4jo7q54v
time="2024-09-18T13:08:36.390Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1608334878 namespace=zoo workflow=53260
time="2024-09-18T13:08:36.390Z" level=info msg="Task-result reconciliation" namespace=zoo numObjs=0 workflow=53260
time="2024-09-18T13:08:36.390Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.390Z" level=info msg="node changed" namespace=zoo new.message= new.phase=Running new.progress=0/1 nodeID=53260 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=53260
time="2024-09-18T13:08:36.391Z" level=info msg="SG Outbound nodes of 53260 are [53260]" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.392Z" level=info msg="Workflow step group node 53260 not yet completed" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.392Z" level=info msg="TaskSet Reconciliation" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.392Z" level=info msg=reconcileAgentPod namespace=zoo workflow=53260
time="2024-09-18T13:08:36.392Z" level=info msg="Workflow to be dehydrated" Workflow Size=6717
time="2024-09-18T13:08:36.464Z" level=info msg="Workflow update successful" namespace=zoo phase=Running resourceVersion=1608335305 workflow=53260
time="2024-09-18T13:08:36.708Z" level=info msg="Processing workflow" Phase= ResourceVersion=1608335319 namespace=zoo workflow=53260
time="2024-09-18T13:08:36.720Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2024-09-18T13:08:36.720Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2024-09-18T13:08:36.720Z" level=info msg="Task-result reconciliation" namespace=zoo numObjs=0 workflow=53260
time="2024-09-18T13:08:36.720Z" level=info msg="Updated phase  -> Running" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.720Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.720Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.720Z" level=info msg="Steps node 53260 initialized Running" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.720Z" level=info msg="StepGroup node 53260 initialized Running" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.720Z" level=info msg="Skipping 53260[0].getBatchlist: when 'false == true' evaluated false" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.720Z" level=info msg="Skipped node 53260 initialized Skipped (message: when 'false == true' evaluated false)" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.720Z" level=info msg="Step group node 53260 successful" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.720Z" level=info msg="node 53260 phase Running -> Succeeded" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.720Z" level=info msg="node 53260 finished: 2024-09-18 13:08:36.720966677 +0000 UTC" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.720Z" level=info msg="StepGroup node 53260 initialized Running" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.721Z" level=info msg="SG Outbound nodes of 53260 are [53260]" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.721Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.721Z" level=info msg="Retry node 53260 initialized Running" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.722Z" level=info msg="Pod node 53260 initialized Pending" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.751Z" level=info msg="Created pod: 53260[1].runTestbatch(0) (53260)" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.751Z" level=info msg="Workflow step group node 53260 not yet completed" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.751Z" level=info msg="TaskSet Reconciliation" namespace=zoo workflow=53260
time="2024-09-18T13:08:36.751Z" level=info msg=reconcileAgentPod namespace=zoo workflow=53260
time="2024-09-18T13:08:36.751Z" level=info msg="Workflow to be dehydrated" Workflow Size=6354
time="2024-09-18T13:08:36.776Z" level=info msg="Workflow update successful" namespace=zoo phase=Running resourceVersion=1608335327 workflow=53260
time="2024-09-18T13:08:37.303Z" level=info msg="Processing workflow" Phase= ResourceVersion=1608335356 namespace=zoo workflow=53260
time="2024-09-18T13:08:37.313Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"

Logs from in your workflow's wait container

time="2024-09-18T12:43:25.214Z" level=info msg="Starting Workflow Executor" version=v3.5.10
time="2024-09-18T12:43:25.219Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-09-18T12:43:25.220Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=zoo podName=foobar-1516080840 templateName=runzoo version="&Version{Version:v3.5.10,BuildDate:2024-08-01T05:11:25Z,GitCommit:25829927431d9a0f46d17b72ae74aedb8d700884,GitTag:v3.5.10,GitTreeState:clean,GoVersion:go1.21.12,Compiler:gc,Platform:linux/amd64,}"
time="2024-09-18T12:43:25.228Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:43:25.266Z" level=info msg="Starting deadline monitor"
time="2024-09-18T12:44:25.269Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:45:25.272Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:46:25.270Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:47:25.275Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:48:25.220Z" level=info msg="Alloc=7855 TotalAlloc=15597 Sys=24165 NumGC=6 Goroutines=8"
time="2024-09-18T12:48:25.269Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:49:25.269Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:50:25.269Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:51:25.268Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:52:25.269Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:53:25.221Z" level=info msg="Alloc=7918 TotalAlloc=17267 Sys=24165 NumGC=8 Goroutines=8"
time="2024-09-18T12:53:25.268Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:54:25.272Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:55:25.270Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:56:25.270Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:57:25.269Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:58:25.220Z" level=info msg="Alloc=7597 TotalAlloc=18949 Sys=24165 NumGC=11 Goroutines=8"
time="2024-09-18T12:58:25.269Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T12:59:25.268Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T13:00:25.271Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T13:01:25.280Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T13:02:25.305Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T13:03:25.221Z" level=info msg="Alloc=7944 TotalAlloc=20604 Sys=24165 NumGC=13 Goroutines=8"
time="2024-09-18T13:03:25.269Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T13:04:25.269Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T13:05:25.269Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T13:06:25.270Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T13:07:25.269Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T13:08:22.934Z" level=info msg="Main container completed" error="<nil>"
time="2024-09-18T13:08:22.934Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-09-18T13:08:22.934Z" level=info msg="No output parameters"
time="2024-09-18T13:08:22.934Z" level=info msg="Saving output artifacts"
time="2024-09-18T13:08:22.935Z" level=info msg="Staging artifact: zoo"
time="2024-09-18T13:08:22.935Z" level=info msg="Staging foobar from mirrored volume mount /mainctrfs/foobar"
time="2024-09-18T13:08:22.935Z" level=info msg="No compression strategy needed. Staging skipped"
time="2024-09-18T13:08:23.085Z" level=info msg="S3 Save path: foobar, key: foobar"
time="2024-09-18T13:08:23.086Z" level=info msg="Creating minio client using static credentials" endpoint="foobar"
time="2024-09-18T13:08:23.104Z" level=info msg="Saving file to s3" bucket=artifacts endpoint="foobar" key=foobar path=foobar
time="2024-09-18T13:08:25.018Z" level=info msg="Save artifact" artifactName=zoo duration=1.932563921s error="<nil>" key=foobar
time="2024-09-18T13:08:25.018Z" level=info msg="not deleting local artifact" localArtPath=/mainctrfs/foobar
time="2024-09-18T13:08:25.018Z" level=info msg="Successfully saved file: /mainctrfs/foobar"
time="2024-09-18T13:08:25.020Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T13:08:25.043Z" level=info msg="Alloc=8445 TotalAlloc=23100 Sys=24165 NumGC=16 Goroutines=10"
time="2024-09-18T13:08:25.045Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io \"foobar-1516080840\" is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot patch resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
time="2024-09-18T13:08:25.221Z" level=info msg="Alloc=8481 TotalAlloc=23136 Sys=24165 NumGC=16 Goroutines=11"
time="2024-09-18T13:08:25.269Z" level=warning msg="failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:zoo:zoo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"zoo\""
@heyleke
Copy link
Contributor Author

heyleke commented Sep 19, 2024

I explicitly left the warning around:
failed to patch task result, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/

in the logs, as I wonder if this could the reason we're hitting this, and so resolving this, might solve our problem, but as I'm not sure, I did wanted to file the issue if there is a race issue between controller and wait container somehow.

@heyleke
Copy link
Contributor Author

heyleke commented Sep 19, 2024

might be related to #13454

@heyleke
Copy link
Contributor Author

heyleke commented Sep 19, 2024

looking at #13496 (comment) it seems that this must be related to the missing RBAC config according to Alex Peelman, but @jswxstw seems to have some doubts...

@agilgur5 agilgur5 changed the title signaled container: wait error: unable to upgrade connection: container not found. Workflow node in state Succeeded but wf not progressing signaled container: wait error: unable to upgrade connection: container not found. node Succeeded but wf not progressing Sep 19, 2024
@jswxstw
Copy link
Member

jswxstw commented Sep 20, 2024

#13491 fixed the scenario where wait container was cleaned up due to pod eviction.

Can you check why wait container is missed? I guess it may be related to #13344.

@heyleke
Copy link
Contributor Author

heyleke commented Sep 20, 2024

last timestamp in wait container logs is:

time="2024-09-18T13:08:25.269Z"

on controller side we see:

time="2024-09-18T13:08:25.437Z" level=info msg="signaled container" container=wait error="unable to upgrade connection: container not found (\"wait\")" namespace=zoo pod=foobar-1516080840 stderr="<nil>" stdout="<nil>"
time="2024-09-18T13:08:55.438Z" level=info msg="cleaning up pod" action=killContainers key=zoo/foobar-1516080840/killContainers

On the host kubelet log, I see the container died at that moment (afaiu)

Sep 18 13:08:27 worker kubelet: I0918 13:08:27.424302    1120 kubelet.go:2140] "SyncLoop (PLEG): event for pod" pod="zoo/foobar-1516080840" event=&{ID:46cd6bb0-85b0-4f05-b186-013ee3f7f971 Type:ContainerDied Data:c5664344dfa2958e6f9b495759a564c876a30ce58a7ddcc541be11cc94fe856b}

So for some reason the container was not reachable at the moment controller wanted. But I repeat, that is why I opened the issue: if this happens, a workflow can get stuck, but it might also just be because of the missing RBAC configuration.

@jswxstw
Copy link
Member

jswxstw commented Sep 20, 2024

So for some reason the container was not reachable at the moment controller wanted.

The most common reason for workflow stuck is abnormal exit of wait container, and it is fixed by #13537, #13491.
You can see #13537 (comment) as a summary.

but it might also just be because of the missing RBAC configuration.

I don't think your issue is caused by missing RBAC configuration, since this issue should be reproducible consistently if so.
Moreover, this is an edge case of an edge case, see #13496 (comment).

on controller side we see:

time="2024-09-18T13:08:25.437Z" level=info msg="signaled container" container=wait error="unable to upgrade connection: container not found ("wait")" namespace=zoo pod=foobar-1516080840 stderr="" stdout=""
time="2024-09-18T13:08:55.438Z" level=info msg="cleaning up pod" action=killContainers key=zoo/foobar-1516080840/killContainers

This log shows wait container is missed, so I want to confirm whether the controller can observe that the wait container exits abnormally, which may cause #13491 not working.

@heyleke
Copy link
Contributor Author

heyleke commented Sep 20, 2024

unfortunately, I don't have the status of the workflow, only the contorller and wait log messages + kubelet output of host running the pod. Nor do I further info on the pod(wf node) reason of final removal. I only see the pod (and so must be the wait container) was still alive at the moment controller had that "unable to upgrade connection: container not found ", for unknown reasons atm (api communication/network glitch/...). But the fact it results in the workflow to stay in running state, is not ok imho.

@jswxstw
Copy link
Member

jswxstw commented Sep 20, 2024

I recommend upgrading to a newer version of argo that contains the fixes above, which can solve most stuck issues.
And you need to fix your RBAC configuration problems, otherwise it will not be available after v3.6.0 because of #13100.

@Joibel
Copy link
Member

Joibel commented Sep 20, 2024

You really should fix your RBAC. failed to patch task result indicates you haven't and anything since 3.5.5 really struggles to run a basic workflow without the the correct RBAC. But as @jswxstw indicates the result should be a consistent failure.

@heyleke I'm intrigued by your unable to upgrade connection messages. What kubernetes version and provider (e.g. Amazon/Azure/Rancher) are you on? I've been seeing this with 3.6.0-RC1 on vcluster, but NOT with 3.5 at all and I'm trying to narrow it down enough to raise an issue and try and fix it.

@heyleke
Copy link
Contributor Author

heyleke commented Sep 20, 2024

Hi @Joibel , this is on a private kubespray provisioned cluster with K8s v1.23.7, running on top of kernel 5.4 and CRE containerd://1.6.4

@agilgur5
Copy link

agilgur5 commented Sep 21, 2024

I've been seeing this with 3.6.0-RC1 on vcluster, but NOT with 3.5 at all and I'm trying to narrow it down enough to raise an issue and try and fix it.

The 3.6.0-rc1 issue is potentially related to #13012 as that is an executor request through SPDY. It didn't make it into 3.5.x as it requires newer k8s, which might be vaguely related to:

private kubespray provisioned cluster with K8s v1.23.7

IIRC oldest k8s that 3.5 was tested against was 1.24, but you can double-check the history on that one

Reproduction not done with 'latests' as that is not possible in production environment atm. [sic]

That's why you have staging, dev, and local 😉

@heyleke
Copy link
Contributor Author

heyleke commented Sep 24, 2024

we have staging and dev, but it is so rare issue, that it has not been seen in those environments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants