-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3.5.5+: Workflow stuck in Running
but all nodes completed -- incorrect RBAC
#13496
Comments
Running
but all nodes completed
I can't reproduce it and I don't see what's wrong with this. I think it needs more information. |
I'll share my test setup if it can help Running on minikube
Argo installed on minikube using a small ZSH script #!/bin/zsh
set -euo pipefail
ARGO_NAMESPACE=argo
ARGO_VERSION=v3.5.10
echo "Install argo workflows ${ARGO_VERSION} in ${ARGO_NAMESPACE}"
kubectl create namespace ${ARGO_NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -n ${ARGO_NAMESPACE} -f https://github.com/argoproj/argo-workflows/releases/download/${ARGO_VERSION}/install.yaml I am also using Argo events but this is out of scope for the issue. |
I'm running it locally with branch # argo get workflow-keeps-running
Name: workflow-keeps-running
Namespace: argo
# I did not set the ServiceAccount to `argo-workflows`
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Succeeded
Conditions:
PodRunning False
Completed True
Created: Mon Aug 26 16:18:31 +0800 (4 minutes ago)
Started: Mon Aug 26 16:18:31 +0800 (4 minutes ago)
Finished: Mon Aug 26 16:18:47 +0800 (3 minutes ago)
Duration: 16 seconds
Progress: 2/2
ResourcesDuration: 0s*(1 cpu),5s*(100Mi memory)
STEP TEMPLATE PODNAME DURATION MESSAGE
✔ workflow-keeps-running entrypoint
├─✔ task task-template workflow-keeps-running-task-template-2223580216 4s
├─○ await-task delay when 'false' evaluated false
├─✔ task-finished finishing workflow-keeps-running-finishing-1300587817 6s
|
I don't have the time now to run the devcontainer setup so I am continuing with my minikube environment. I tried the 3.5.5 release and it looks like it is just stuck in general. I am still using my ServiceAccount etc. Good idea to get the logs out because I see a smoking gun in the wait container logs |
"status": {
"phase": "Running",
"startedAt": "2024-08-26T08:47:29Z",
"finishedAt": null,
"progress": "0/1",
"nodes": {
"workflow-keeps-running": {
"id": "workflow-keeps-running",
"name": "workflow-keeps-running",
"displayName": "workflow-keeps-running",
"type": "DAG",
"templateName": "entrypoint",
"templateScope": "local/workflow-keeps-running",
"phase": "Running",
"startedAt": "2024-08-26T08:47:29Z",
"finishedAt": null,
"progress": "0/1",
"children": [
"workflow-keeps-running-2223580216"
]
},
"workflow-keeps-running-2223580216": {
"id": "workflow-keeps-running-2223580216",
"name": "workflow-keeps-running.task",
"displayName": "task",
"type": "Pod",
"templateName": "task-template",
"templateScope": "local/workflow-keeps-running",
"phase": "Pending",
"boundaryID": "workflow-keeps-running",
"startedAt": "2024-08-26T08:47:29Z",
"finishedAt": null,
"progress": "0/1"
}
},
"taskResultsCompletionStatus": {
"workflow-keeps-running-2223580216": false,
# bug: task result name does not equal to node id.
"workflow-keeps-running-task-template-2223580216": true
}
} Release v3.5.5 has bug: #12733. Can you try v3.5.10 since you said your version is v3.5.10? |
I can't reproduce it either. I'm running on apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
labels:
workflows.argoproj.io/completed: "true"
workflows.argoproj.io/phase: Succeeded
name: workflow-keeps-running
namespace: default
resourceVersion: "78560"
uid: a81ec63c-6733-4a81-baa2-b64f4542bdbf
spec:
arguments: {}
entrypoint: entrypoint
templates:
- dag:
tasks:
- arguments: {}
name: task
template: task-template
- arguments: {}
depends: task.Succeeded
name: await-task
template: delay
when: '{{=jsonpath(tasks[''task''].outputs.result,''$.value'') > 0}}'
- arguments: {}
depends: await-task.Succeeded
name: task-next-iteration
template: entrypoint
- arguments: {}
depends: task.Skipped
name: task-circuit-breaker
template: finishing
- arguments: {}
depends: task.Succeeded
name: task-finished
template: finishing
when: '{{=jsonpath(tasks[''task''].outputs.result,''$.value'') == 0}}'
...
status:
artifactGCStatus:
notSpecified: true
artifactRepositoryRef:
artifactRepository: {}
default: true
conditions:
- status: "False"
type: PodRunning
- status: "True"
type: Completed
finishedAt: "2024-08-26T09:06:41Z"
nodes:
...
phase: Succeeded
progress: 2/2
startedAt: "2024-08-26T09:05:59Z"
taskResultsCompletionStatus:
workflow-keeps-running-1300587817: true
workflow-keeps-running-2223580216: true |
I found the problem, it is related to the When using v3.4.x I did not have apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: operate-workflow-role
rules:
- apiGroups:
- argoproj.io
resources:
- workflows
- workflowtemplates
- cronworkflows
- clusterworkflowtemplates As soon as I add the apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: operate-workflow-role
rules:
- apiGroups:
- argoproj.io
resources:
- workflows
- workflowtemplates
- cronworkflows
- clusterworkflowtemplates
- workflowtaskresults It works for v3.5.5
It also works for v3.5.10
Sorry for the ruckus, I should have checked the logs better before reaching out. |
@alexpeelman I think you wrote the wrong version(v3.5.10) in your issue description. #12733 has been fixed in v3.5.10. |
This is IMO "same same, but different". I was using v3.5.10 when the issue popped up. Considering this is related to an incorrect
I don't know how to proceed with this to make it work for you guys ? I can close and mark this as resolved because it's really a config mistake. |
Workflow will not stuck in |
Retried it again on Removed wf-get-3_5_10-stuck-running.json The complete WF controller logs for both runs |
I reproduced it when executor only has |
Running
but all nodes completedRunning
but all nodes completed
Running
but all nodes completedRunning
but all nodes completed -- incorrect RBAC
As a result, the outputs reported by workflowtaskresult and pod are inconsistent, and the status in workflowtaskresult is finally taken, which is wrong. Controller debug log:
@agilgur5 Do you think this is a bug in the controller? Or do we need to adapt to this mismatch scenario? |
Thanks for root causing this @jswxstw!
Well that's very confusing. Edge case of an edge case here, so unsurprising that it wasn't handled. Note that the fallback code will all be removed in 3.6 as well: #13100 , so that is perhaps not worth fixing, especially given the rarity of this edge case that only has partial RBAC
Should this case be handled by #13454? Since incomplete WorkflowTaskResult but completed Pod is the case of #12993 |
@agilgur5 I'm afraid not.
|
@Joibel do you think you could take a look at this case of the issue as well? |
isn't this a case of just documenting required permissions? |
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
I have a workflow template that recursively calls a DAG and uses some conditional logic to skip/omit certain tasks. It also slaves on the built-in suspend template.
What I notice is that all nodes and pods run to completion and are in either a Succeeded, Skipped or Omitted state but the workflow status is still
Running
I'd expect the workflow state to be Succeeded iso of Running.
I traced back all argo workflow releases and this workflow works as expected in v3.5.4.
Version(s)
v3.5.10
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: