Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.5.5+: Workflow stuck in Running but all nodes completed -- incorrect RBAC #13496

Open
3 of 4 tasks
alexpeelman opened this issue Aug 23, 2024 · 18 comments
Open
3 of 4 tasks
Labels
area/controller Controller issues, panics solution/duplicate This issue or PR is a duplicate of an existing one type/bug type/regression Regression from previous behavior (a specific type of bug) type/support User support issue - likely not a bug

Comments

@alexpeelman
Copy link

alexpeelman commented Aug 23, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

I have a workflow template that recursively calls a DAG and uses some conditional logic to skip/omit certain tasks. It also slaves on the built-in suspend template.

What I notice is that all nodes and pods run to completion and are in either a Succeeded, Skipped or Omitted state but the workflow status is still Running

Name:                workflow-keeps-running
Namespace:           argo-events
ServiceAccount:      argo-workflows
Status:              Running
Conditions:          
 PodRunning          False
Created:             Fri Aug 23 15:42:49 +0200 (3 minutes ago)
Started:             Fri Aug 23 15:42:49 +0200 (3 minutes ago)
Duration:            3 minutes 26 seconds
Progress:            8/8
ResourcesDuration:   0s*(1 cpu),17s*(100Mi memory)

STEP                          TEMPLATE       PODNAME                                          DURATION  MESSAGE
 ✔ workflow-keeps-running     entrypoint                                                                                              
 ├─✔ task                     task-template  workflow-keeps-running-task-template-2223580216  3s                                      
 ├─✔ await-task               delay                                                                                                   
 ├─○ task-finished            finishing                                                                 when 'false' evaluated false  
 └─✔ task-next-iteration      entrypoint                                                                                              
   ├─✔ task                   task-template  workflow-keeps-running-task-template-2686283671  3s                                      
   ├─✔ await-task             delay                                                                                                   
   ├─○ task-finished          finishing                                                                 when 'false' evaluated false  
   └─✔ task-next-iteration    entrypoint                                                                                              
     ├─✔ task                 task-template  workflow-keeps-running-task-template-2162925756  3s                                      
     ├─✔ await-task           delay                                                                                                   
     ├─○ task-finished        finishing                                                                 when 'false' evaluated false  
     └─✔ task-next-iteration  entrypoint                                                                                              
       ├─✔ task               task-template  workflow-keeps-running-task-template-3056561099  3s                                      
       ├─○ await-task         delay                                                                     when 'false' evaluated false  
       ├─✔ task-finished      finishing      workflow-keeps-running-finishing-3897901956      5s  

I'd expect the workflow state to be Succeeded iso of Running.
I traced back all argo workflow releases and this workflow works as expected in v3.5.4.

Version(s)

v3.5.10

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: workflow-keeps-running
spec:
  serviceAccountName: argo-workflows
  entrypoint: entrypoint

  templates:
    - name: entrypoint
      dag:
        tasks:
          - name: task
            template: task-template

          - name: await-task
            depends: "task.Succeeded"
            when: "{{=jsonpath(tasks['task'].outputs.result,'$.value') > 0}}"
            template: delay

          - name: task-next-iteration
            template: entrypoint
            depends: "await-task.Succeeded"

          - name: task-circuit-breaker
            depends: "task.Skipped"
            template: finishing

          - name: task-finished
            depends: "task.Succeeded"
            when: "{{=jsonpath(tasks['task'].outputs.result,'$.value') == 0}}"
            template: finishing

    - name: delay
      suspend:
        duration: 1s

    - name: task-template
      container:
        command: [ sh, -c ]
        image: alpine:3.7
        args:
          - |
            JSON_FMT='{"value":%s}'
            RND=$(( $RANDOM % 2 ))
            printf "$JSON_FMT" "$RND"

    - name: finishing
      container:
        image: busybox
        command: [ echo ]
        args: [ "near the finish" ]

Logs from the workflow controller

None

Logs from in your workflow's wait container

None
@agilgur5 agilgur5 changed the title Workflow stuck in Running although nodes run to completion Workflow stuck in Running but all nodes completed Aug 24, 2024
@agilgur5 agilgur5 added the area/controller Controller issues, panics label Aug 24, 2024
@agilgur5 agilgur5 changed the title Workflow stuck in Running but all nodes completed v3.5.5+: Workflow stuck in Running but all nodes completed Aug 24, 2024
@agilgur5 agilgur5 added the type/regression Regression from previous behavior (a specific type of bug) label Aug 24, 2024
@agilgur5
Copy link

agilgur5 commented Aug 24, 2024

This sounds like a duplicate of #12103, although this is concretely a v3.5.5 regression whereas that one happened in v3.4. cc @jswxstw

@agilgur5 agilgur5 added this to the v3.5.x patches milestone Aug 24, 2024
@agilgur5 agilgur5 added the solution/duplicate This issue or PR is a duplicate of an existing one label Aug 24, 2024
@agilgur5 agilgur5 changed the title v3.5.5+: Workflow stuck in Running but all nodes completed v3.5.5+: Workflow stuck in Running but all nodes completed Aug 24, 2024
@jswxstw
Copy link
Member

jswxstw commented Aug 24, 2024

I can't reproduce it and I don't see what's wrong with this. I think it needs more information.

@alexpeelman
Copy link
Author

alexpeelman commented Aug 26, 2024

  1. If you say you can't reproduce
  • How did you run it
  • Against which version did you test so I can retry that one
  1. What kind of extra information are you looking for ?

I'll share my test setup if it can help

Running on minikube

minikube version: v1.33.1
commit: 5883c09216182566a63dff4c326a6fc9ed2982ff

Argo installed on minikube using a small ZSH script

#!/bin/zsh
set -euo pipefail

ARGO_NAMESPACE=argo
ARGO_VERSION=v3.5.10

echo "Install argo workflows ${ARGO_VERSION} in ${ARGO_NAMESPACE}"
kubectl create namespace ${ARGO_NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -n ${ARGO_NAMESPACE} -f https://github.com/argoproj/argo-workflows/releases/download/${ARGO_VERSION}/install.yaml

I am also using Argo events but this is out of scope for the issue.

@jswxstw
Copy link
Member

jswxstw commented Aug 26, 2024

  1. If you say you can't reproduce
  • How did you run it
  • Against which version did you test so I can retry that one

I'm running it locally with branch main and release-3.5.

# argo get workflow-keeps-running
Name:                workflow-keeps-running
Namespace:           argo
# I did not set the ServiceAccount to `argo-workflows`
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Succeeded
Conditions:          
 PodRunning          False
 Completed           True
Created:             Mon Aug 26 16:18:31 +0800 (4 minutes ago)
Started:             Mon Aug 26 16:18:31 +0800 (4 minutes ago)
Finished:            Mon Aug 26 16:18:47 +0800 (3 minutes ago)
Duration:            16 seconds
Progress:            2/2
ResourcesDuration:   0s*(1 cpu),5s*(100Mi memory)

STEP                       TEMPLATE       PODNAME                                          DURATION  MESSAGE
 ✔ workflow-keeps-running  entrypoint                                                                                              
 ├─✔ task                  task-template  workflow-keeps-running-task-template-2223580216  4s                                      
 ├─○ await-task            delay                                                                     when 'false' evaluated false  
 ├─✔ task-finished         finishing      workflow-keeps-running-finishing-1300587817      6s
  1. What kind of extra information are you looking for ?
  • Logs from the workflow controller
  • Logs from in your workflow's wait container
  • Detail status of the workflow workflow-keeps-running

@alexpeelman
Copy link
Author

alexpeelman commented Aug 26, 2024

I don't have the time now to run the devcontainer setup so I am continuing with my minikube environment.

I tried the 3.5.5 release and it looks like it is just stuck in general. I am still using my ServiceAccount etc.

Good idea to get the logs out because I see a smoking gun in the wait container logs
time="2024-08-26T08:47:32.753Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io \"workflow-keeps-running-2223580216\" is forbidden: User \"system:serviceaccount:argo-events:argo-workflows\" cannot patch resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo-events\""

wf-get-3_5_5.json

wf-logs-3_5_5.txt

wf-logs-wait-container-3_5_5.txt

wf-controller-logs-3_5_5.txt

@jswxstw
Copy link
Member

jswxstw commented Aug 26, 2024

"status": {
        "phase": "Running",
        "startedAt": "2024-08-26T08:47:29Z",
        "finishedAt": null,
        "progress": "0/1",
        "nodes": {
            "workflow-keeps-running": {
                "id": "workflow-keeps-running",
                "name": "workflow-keeps-running",
                "displayName": "workflow-keeps-running",
                "type": "DAG",
                "templateName": "entrypoint",
                "templateScope": "local/workflow-keeps-running",
                "phase": "Running",
                "startedAt": "2024-08-26T08:47:29Z",
                "finishedAt": null,
                "progress": "0/1",
                "children": [
                    "workflow-keeps-running-2223580216"
                ]
            },
            "workflow-keeps-running-2223580216": {
                "id": "workflow-keeps-running-2223580216",
                "name": "workflow-keeps-running.task",
                "displayName": "task",
                "type": "Pod",
                "templateName": "task-template",
                "templateScope": "local/workflow-keeps-running",
                "phase": "Pending",
                "boundaryID": "workflow-keeps-running",
                "startedAt": "2024-08-26T08:47:29Z",
                "finishedAt": null,
                "progress": "0/1"
            }
        },
        "taskResultsCompletionStatus": {
            "workflow-keeps-running-2223580216": false,
            # bug: task result name does not equal to node id.
            "workflow-keeps-running-task-template-2223580216": true
        }
    }

Release v3.5.5 has bug: #12733. Can you try v3.5.10 since you said your version is v3.5.10?

@chengjoey
Copy link
Contributor

chengjoey commented Aug 26, 2024

I can't reproduce it either. I'm running on v3.5.10. This is the result of my run. I tried about 3 times and each time it was Succeeded

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  labels:
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/phase: Succeeded
  name: workflow-keeps-running
  namespace: default
  resourceVersion: "78560"
  uid: a81ec63c-6733-4a81-baa2-b64f4542bdbf
spec:
  arguments: {}
  entrypoint: entrypoint
  templates:
  - dag:
      tasks:
      - arguments: {}
        name: task
        template: task-template
      - arguments: {}
        depends: task.Succeeded
        name: await-task
        template: delay
        when: '{{=jsonpath(tasks[''task''].outputs.result,''$.value'') > 0}}'
      - arguments: {}
        depends: await-task.Succeeded
        name: task-next-iteration
        template: entrypoint
      - arguments: {}
        depends: task.Skipped
        name: task-circuit-breaker
        template: finishing
      - arguments: {}
        depends: task.Succeeded
        name: task-finished
        template: finishing
        when: '{{=jsonpath(tasks[''task''].outputs.result,''$.value'') == 0}}'
    ...
status:
  artifactGCStatus:
    notSpecified: true
  artifactRepositoryRef:
    artifactRepository: {}
    default: true
  conditions:
  - status: "False"
    type: PodRunning
  - status: "True"
    type: Completed
  finishedAt: "2024-08-26T09:06:41Z"
  nodes:
    ...
  phase: Succeeded
  progress: 2/2
  startedAt: "2024-08-26T09:05:59Z"
  taskResultsCompletionStatus:
    workflow-keeps-running-1300587817: true
    workflow-keeps-running-2223580216: true

@alexpeelman
Copy link
Author

alexpeelman commented Aug 26, 2024

I found the problem, it is related to the role configuration in my k8s setup. The logs I attached here from the wait container (https://github.com/user-attachments/files/16746500/wf-logs-wait-container-3_5_5.txt) gave it away :).

When using v3.4.x I did not have workflowtaskresults as resources configured and everything works:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: operate-workflow-role
rules:
  - apiGroups:
      - argoproj.io
    resources:
      - workflows
      - workflowtemplates
      - cronworkflows
      - clusterworkflowtemplates

As soon as I add the workflowtaskresults resource and switch to v3.5.5 everything runs to completion. Somehow I missed this requirement.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: operate-workflow-role
rules:
  - apiGroups:
      - argoproj.io
    resources:
      - workflows
      - workflowtemplates
      - cronworkflows
      - clusterworkflowtemplates
      - workflowtaskresults

It works for v3.5.5

Name:                workflow-keeps-running
Namespace:           argo-events
ServiceAccount:      argo-workflows
Status:              Succeeded
Conditions:          
 PodRunning          False
 Completed           True
Created:             Mon Aug 26 11:03:03 +0200 (41 seconds ago)
Started:             Mon Aug 26 11:03:03 +0200 (41 seconds ago)
Finished:            Mon Aug 26 11:03:44 +0200 (now)
Duration:            41 seconds
Progress:            4/4
ResourcesDuration:   0s*(1 cpu),16s*(100Mi memory)


STEP                       TEMPLATE       PODNAME                                          DURATION  MESSAGE
 ✔ workflow-keeps-running  entrypoint                                                                                              
 ├─✔ task                  task-template  workflow-keeps-running-task-template-2223580216  14s                                     
 ├─✔ await-task            delay                                                                                                   
 ├─○ task-finished         finishing                                                                 when 'false' evaluated false  
 └─✔ task-next-iteration   entrypoint                                                                                              
   ├─✔ task                task-template  workflow-keeps-running-task-template-2686283671  3s                                      
   ├─○ await-task          delay                                                                     when 'false' evaluated false  
   ├─✔ task-finished       finishing      workflow-keeps-running-finishing-2920786288      6s     

It also works for v3.5.10

Name:                workflow-keeps-running
Namespace:           argo-events
ServiceAccount:      argo-workflows
Status:              Succeeded
Conditions:          
 PodRunning          False
 Completed           True
Created:             Mon Aug 26 12:27:21 +0200 (1 minute ago)
Started:             Mon Aug 26 12:27:21 +0200 (1 minute ago)
Finished:            Mon Aug 26 12:28:35 +0200 (28 seconds ago)
Duration:            1 minute 14 seconds
Progress:            10/10
ResourcesDuration:   26s*(100Mi memory),0s*(1 cpu)

STEP                            TEMPLATE       PODNAME                                          DURATION  MESSAGE
 ✔ workflow-keeps-running       entrypoint                                                                                              
 ├─✔ task                       task-template  workflow-keeps-running-task-template-2223580216  13s                                     
 ├─✔ await-task                 delay                                                                                                   
 ├─○ task-finished              finishing                                                                 when 'false' evaluated false  
 └─✔ task-next-iteration        entrypoint                                                                                              
   ├─✔ task                     task-template  workflow-keeps-running-task-template-2686283671  4s                                      
   ├─✔ await-task               delay                                                                                                   
   ├─○ task-finished            finishing                                                                 when 'false' evaluated false  
   └─✔ task-next-iteration      entrypoint                                                                                              
     ├─✔ task                   task-template  workflow-keeps-running-task-template-2162925756  3s                                      
     ├─✔ await-task             delay                                                                                                   
     ├─○ task-finished          finishing                                                                 when 'false' evaluated false  
     └─✔ task-next-iteration    entrypoint                                                                                              
       ├─✔ task                 task-template  workflow-keeps-running-task-template-3056561099  3s                                      
       ├─✔ await-task           delay                                                                                                   
       ├─○ task-finished        finishing                                                                 when 'false' evaluated false  
       └─✔ task-next-iteration  entrypoint                                                                                              
         ├─✔ task               task-template  workflow-keeps-running-task-template-4192763504  4s                                      
         ├─○ await-task         delay                                                                     when 'false' evaluated false  
         ├─✔ task-finished      finishing      workflow-keeps-running-finishing-2721574369      6s    

Sorry for the ruckus, I should have checked the logs better before reaching out.

@jswxstw
Copy link
Member

jswxstw commented Aug 26, 2024

Release v3.5.5 has bug: #12733. Can you try v3.5.10 since you said your version is v3.5.10?

@alexpeelman I think you wrote the wrong version(v3.5.10) in your issue description. #12733 has been fixed in v3.5.10.

@alexpeelman
Copy link
Author

alexpeelman commented Aug 26, 2024

This is IMO "same same, but different". I was using v3.5.10 when the issue popped up. Considering this is related to an incorrect role configuration from my side, it mimics the same behaviour as the the issue you are referencing. So independent from the fix, if I don't include workflowtaskresults in my k8s role definition used by the service account, the patch operation fails and hence the workflow is stuck.

... cannot patch resource "workflowtaskresults" in API group "argoproj.io"

I don't know how to proceed with this to make it work for you guys ? I can close and mark this as resolved because it's really a config mistake.

@jswxstw
Copy link
Member

jswxstw commented Aug 26, 2024

Workflow will not stuck in Running even if there are RBAC problems, this is a bug if so (like #12733).
Have you tested your workflow in v3.5.10 without workflowtaskresults access permissions? We haven't reproduced it in v3.5.10 (I removed workflowtaskresults access permissions for executor, still not reproduced).

@alexpeelman
Copy link
Author

Retried it again on v3.5.10, with workflowtaskresults set it works
wf-get-3_5_10-success.json
wf-logs-wait-3_5_10-success.txt

Removed workflowtaskresults, then it is stuck and it it keeps the workflow in Running state.
Do mind, the WF controller runs in a different namespace (argo) then the runtime for the workflow and pods (argo-events).

wf-get-3_5_10-stuck-running.json
wf-logs-wait-3_5_10-running.txt

The complete WF controller logs for both runs
wf-controller-full-logs-3_5_10.txt

@jswxstw
Copy link
Member

jswxstw commented Aug 26, 2024

Removed workflowtaskresults, then it is stuck and it it keeps the workflow in Running state.

I reproduced it when executor only has workflowtaskresults create permission but does not have patch permission.

@agilgur5 agilgur5 changed the title v3.5.5+: Workflow stuck in Running but all nodes completed v3.5.5: Workflow stuck in Running but all nodes completed Aug 26, 2024
@agilgur5 agilgur5 added the type/support User support issue - likely not a bug label Aug 26, 2024
@agilgur5 agilgur5 changed the title v3.5.5: Workflow stuck in Running but all nodes completed v3.5.5+: Workflow stuck in Running but all nodes completed -- incorrect RBAC Aug 26, 2024
@jswxstw
Copy link
Member

jswxstw commented Aug 27, 2024

I reproduced it when executor only has workflowtaskresults create permission but does not have patch permission.

  • create workflowtaskresults with workflows.argoproj.io/report-outputs-completed: "false" succeeded.
  • patch workflowtaskresults with workflows.argoproj.io/report-outputs-completed: "true" failed.
  • patch pod with workflows.argoproj.io/report-outputs-completed: "true" succeeded.

As a result, the outputs reported by workflowtaskresult and pod are inconsistent, and the status in workflowtaskresult is finally taken, which is wrong.

Controller debug log:

taskresults of workflow are incomplete or still have daemon nodes, so can't mark workflow completed

@agilgur5 Do you think this is a bug in the controller? Or do we need to adapt to this mismatch scenario?

@agilgur5
Copy link

agilgur5 commented Aug 27, 2024

Thanks for root causing this @jswxstw!

As a result, the outputs reported by workflowtaskresult and pod are inconsistent, and the status in workflowtaskresult is finally taken, which is wrong.

Well that's very confusing. Edge case of an edge case here, so unsurprising that it wasn't handled.
Technically the Pod should take priority since it's a fallback.

Note that the fallback code will all be removed in 3.6 as well: #13100 , so that is perhaps not worth fixing, especially given the rarity of this edge case that only has partial RBAC

Controller debug log:

Should this case be handled by #13454? Since incomplete WorkflowTaskResult but completed Pod is the case of #12993

@jswxstw
Copy link
Member

jswxstw commented Aug 28, 2024

Should this case be handled by #13454? Since incomplete WorkflowTaskResult but completed Pod is the case of #12993

@agilgur5 I'm afraid not.
#13454 mark the node as failed after a timeout and mark the workflowtaskresult as completed only when the pod is absent and the node has not been completed.

if !foundPod && !node.Completed() {

@agilgur5
Copy link

@Joibel do you think you could take a look at this case of the issue as well?

@tooptoop4
Copy link
Contributor

isn't this a case of just documenting required permissions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics solution/duplicate This issue or PR is a duplicate of an existing one type/bug type/regression Regression from previous behavior (a specific type of bug) type/support User support issue - likely not a bug
Projects
None yet
Development

No branches or pull requests

5 participants