Workflow `Failed` but Pod still `Pending` #13579

shuangkun · 2024-09-09T14:12:28Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

My workflow failed unexpectedly, but I can't retry it

The actual situation is that the corresponding pod has been completed. Because the workflow failed, the corresponding information was not updated. I think we should support retry in this case.

tianshuangkun@U-4YKHFNR6-2229 argo-workflows % kubectl get pod large-workflow-t696s-sleep-2611185570
NAME                                    READY   STATUS      RESTARTS   AGE
large-workflow-t696s-sleep-2611185570   0/2     Completed   0          46m

Version(s)

v3.4.12

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

a large workflow

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

The text was updated successfully, but these errors were encountered:

agilgur5 · 2024-09-10T00:00:40Z

The actual situation is that the corresponding pod has been completed. Because the workflow failed, the corresponding information was not updated. I think we should support retry in this case.

I don't think this is correct, that sounds like a Controller bug with the Workflow not being correctly tracked as completed. You are on an older version too.

The Workflow should be stopped before being retried, otherwise that can cause very very unpredictable race conditions. Not to mention that the retry process can delete Pods (you do still have #12734 which still needs further iteration and review), which would be even stranger.

All in all, this suggestion sounds like it would dramatically increase unpredictability, which is not good.

If anything, the root cause of the Controller not tracking correctly should be resolved. You're also missing a Workflow that reproduces in this issue too, please make sure to have a reproduction. It's not debuggable otherwise for a root cause analysis

shuangkun · 2024-09-10T01:46:38Z

This workflow is already in the Failed state. When it fails, the pod status changes will not be tracked, so the nodeStatus is still in the Pending state at the moment of Failed, but in fact the pod has since been Completed. Can we manually retry in this scenario? Otherwise, I have no way of knowing how to make my workflow successful. It has 150,000 pods and nearly 90% of the steps have been completed.

agilgur5 · 2024-09-14T15:30:49Z

When it fails, the pod status changes will not be tracked, so the nodeStatus is still in the Pending state at the moment of Failed, but in fact the pod has since been Completed.

Ah I see, thanks for clarifying.

So in this case you can't "stop" the Workflow either since it is already considered "stopped"?

I'm still thinking this is a Controller bug. The Workflow should still be Running if it has Pending Pods and it should correctly track all Pods it ran. It would also need to signal a termination to that Pod once it's up per fastFail or similar logic

github-actions · 2024-09-29T02:26:26Z

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

shuangkun · 2024-10-02T16:05:36Z

Indeed, this should be a bug. It should not be in the Failed state before all pod states are processed.

agilgur5 · 2024-10-10T03:26:47Z

Indeed, this should be a bug. It should not be in the Failed state before all pod states are processed.

I've rewritten the title and adjusted the labels to match this

You're also missing a Workflow that reproduces in this issue too, please make sure to have a reproduction. It's not debuggable otherwise for a root cause analysis

This is still missing a reproduction though

shuangkun · 2024-10-13T06:01:16Z

This happened when I was testing a very large workflow. When invalid connections appeared multiple times when accessing the database, the workflow was directly set to Failed, but many pods were still pending and waiting to be scheduled, and they might continue to run later.

agilgur5 · 2024-10-14T17:43:26Z

That's a description, but not a reproduction. Again, without a reproduction, no one can investigate or debug this.

v3.4.12

This is also not :latest nor the last release of 3.4.17

Please follow the issue templates in full, acting as a good role model to other users and contributors.

jswxstw · 2024-10-23T12:11:57Z

Indeed, this should be a bug. It should not be in the Failed state before all pod states are processed.

So, what is the reason for the workflow's failure? Why was it marked as failed directly?

shuangkun · 2024-10-23T14:17:31Z

The reason I encountered here was that access to the database failed, and the direct task was set to fail.

jswxstw · 2024-10-25T07:17:57Z

@shuangkun Does the problematic workflow use template failFast and parallelism?
Can you check if it's related to issue: #13806?

tooptoop4 · 2024-11-05T19:53:48Z

@shuangkun where da controller logs

github-actions · 2024-11-20T02:26:17Z

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

jchacks · 2024-12-02T14:28:39Z

We have also been experiencing a similar issue where when stopping a workflow from the UI causes the Worklow Phase to be set to Failed but some of the nodes have Pending status. The annoying thing is that when using a Semaphore to limit execution between different workflows. It seems that these Pending nodes still hold on to the lock stopping future runs from being able to start.

shuangkun added the type/bug label Sep 9, 2024

shuangkun self-assigned this Sep 9, 2024

agilgur5 added area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P3 Low priority labels Sep 9, 2024

agilgur5 changed the title ~~Workflow has pending pod should can be retry when failed~~ Workflow with pending pod should be retriable when failed Sep 9, 2024

agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Sep 10, 2024

github-actions bot added the problem/stale This has not had a response in some time label Sep 29, 2024

github-actions bot removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Oct 3, 2024

agilgur5 changed the title ~~Workflow with pending pod should be retriable when failed~~ Workflow Failed but Pod still Pending Oct 10, 2024

agilgur5 added area/controller Controller issues, panics and removed area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Oct 10, 2024

agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Oct 10, 2024

github-actions bot added the problem/stale This has not had a response in some time label Nov 20, 2024

github-actions bot removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow `Failed` but Pod still `Pending` #13579

Workflow `Failed` but Pod still `Pending` #13579

shuangkun commented Sep 9, 2024

agilgur5 commented Sep 10, 2024

shuangkun commented Sep 10, 2024

agilgur5 commented Sep 14, 2024

github-actions bot commented Sep 29, 2024

shuangkun commented Oct 2, 2024

agilgur5 commented Oct 10, 2024

shuangkun commented Oct 13, 2024

agilgur5 commented Oct 14, 2024 •

edited

Loading

jswxstw commented Oct 23, 2024

shuangkun commented Oct 23, 2024 •

edited by agilgur5

Loading

jswxstw commented Oct 25, 2024

tooptoop4 commented Nov 5, 2024

github-actions bot commented Nov 20, 2024

jchacks commented Dec 2, 2024

Workflow Failed but Pod still Pending #13579

Workflow Failed but Pod still Pending #13579

Comments

shuangkun commented Sep 9, 2024

Pre-requisites

What happened? What did you expect to happen?

Version(s)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

agilgur5 commented Sep 10, 2024

shuangkun commented Sep 10, 2024

agilgur5 commented Sep 14, 2024

github-actions bot commented Sep 29, 2024

shuangkun commented Oct 2, 2024

agilgur5 commented Oct 10, 2024

shuangkun commented Oct 13, 2024

agilgur5 commented Oct 14, 2024 • edited Loading

jswxstw commented Oct 23, 2024

shuangkun commented Oct 23, 2024 • edited by agilgur5 Loading

jswxstw commented Oct 25, 2024

tooptoop4 commented Nov 5, 2024

github-actions bot commented Nov 20, 2024

jchacks commented Dec 2, 2024

Workflow `Failed` but Pod still `Pending` #13579

Workflow `Failed` but Pod still `Pending` #13579

agilgur5 commented Oct 14, 2024 •

edited

Loading

shuangkun commented Oct 23, 2024 •

edited by agilgur5

Loading