-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow Failed
but Pod still Pending
#13579
Comments
I don't think this is correct, that sounds like a Controller bug with the Workflow not being correctly tracked as completed. You are on an older version too. The Workflow should be stopped before being retried, otherwise that can cause very very unpredictable race conditions. Not to mention that the retry process can delete Pods (you do still have #12734 which still needs further iteration and review), which would be even stranger. All in all, this suggestion sounds like it would dramatically increase unpredictability, which is not good. If anything, the root cause of the Controller not tracking correctly should be resolved. You're also missing a Workflow that reproduces in this issue too, please make sure to have a reproduction. It's not debuggable otherwise for a root cause analysis |
This workflow is already in the Failed state. When it fails, the pod status changes will not be tracked, so the nodeStatus is still in the Pending state at the moment of Failed, but in fact the pod has since been Completed. Can we manually retry in this scenario? Otherwise, I have no way of knowing how to make my workflow successful. It has 150,000 pods and nearly 90% of the steps have been completed. |
Ah I see, thanks for clarifying. So in this case you can't "stop" the Workflow either since it is already considered "stopped"? I'm still thinking this is a Controller bug. The Workflow should still be |
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs. |
Indeed, this should be a bug. It should not be in the Failed state before all pod states are processed. |
Failed
but Pod still Pending
I've rewritten the title and adjusted the labels to match this
This is still missing a reproduction though |
This happened when I was testing a very large workflow. When invalid connections appeared multiple times when accessing the database, the workflow was directly set to Failed, but many pods were still pending and waiting to be scheduled, and they might continue to run later. |
That's a description, but not a reproduction. Again, without a reproduction, no one can investigate or debug this.
This is also not Please follow the issue templates in full, acting as a good role model to other users and contributors. |
So, what is the reason for the workflow's failure? Why was it marked as failed directly? |
The reason I encountered here was that access to the database failed, and the direct task was set to fail. |
@shuangkun Does the problematic workflow use template |
@shuangkun where da controller logs |
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs. |
We have also been experiencing a similar issue where when stopping a workflow from the UI causes the Worklow Phase to be set to |
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
My workflow failed unexpectedly, but I can't retry it

The actual situation is that the corresponding pod has been completed. Because the workflow failed, the corresponding information was not updated. I think we should support retry in this case.
Version(s)
v3.4.12
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
a large workflow
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: