-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
argo wait
hangs when finishedAt
is not set on workflow
#13550
Comments
argo wait
hangs when finishedAt
is not set on workflow
I reproduced it, but workflow stuck The root cause is as below: argo-workflows/workflow/controller/exec_control.go Lines 48 to 50 in ff2b2dd
argo-workflows/workflow/controller/operator.go Lines 822 to 825 in ff2b2dd
Node is marked as |
@ilia-medvedev-codefresh How could this happen? |
Yeah @jswxstw seems that you are right - I started investigating this issue on one of my clusters that were running 3.5.4 and this problem existed there for sure. I switched to a local env for testing my changes but at some point probably got mixed up with all the different versions. I now saw that I was running 3.5.4 for the controller when I reproduced the RBAC issue. But nonetheless, I still believe it is worth adding this guard rail to |
In my opinion, the fundamental problem is that the workflow is stuck running. I can't think of any scenario where the workflow is argo-workflows/workflow/controller/operator.go Line 2431 in ff2b2dd
I don't see any special logic that would cause the two to be inconsistent. |
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
When running
argo wait
on a workflow that was terminated or finished successfully, but did not have thefinishedAt
status setargo wait <workflow>
hangs without response. The expected behavior is for the command to return immediately as the workflow is in a terminal state.Version(s)
8a67009
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
To my understanding a workflow can have
finishedAt
field null due to various reasons. I was able to reproduce it when the issue from #13496 was manifested.To reproduce, create the following RBAC first:
Then submit the following workflow:
Then terminate the workflow with
argo terminate
(once the sleep pod starts)Now when
argo wait
is run on that workflow it will hang indefinitely.We can see that in the status field for the workflow the
taskResultsCompletionStatus
for the single task of this workflow is set to false,finishedAt
is set tonull
.This is the complete workflow object with the status:
I realize that the task completion is a separate issue (mentioned above) - but there is also faulty logic in wait that relies only on the
finishedAt
status - when there are edge cases where the workflow has a terminal phase butfinishedAt
is not set.Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: