Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow Failed but Pod still Pending #13579

Open
4 tasks done
shuangkun opened this issue Sep 9, 2024 · 14 comments
Open
4 tasks done

Workflow Failed but Pod still Pending #13579

shuangkun opened this issue Sep 9, 2024 · 14 comments
Assignees
Labels
area/controller Controller issues, panics P3 Low priority type/bug

Comments

@shuangkun
Copy link
Member

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

My workflow failed unexpectedly, but I can't retry it
image

The actual situation is that the corresponding pod has been completed. Because the workflow failed, the corresponding information was not updated. I think we should support retry in this case.

tianshuangkun@U-4YKHFNR6-2229 argo-workflows % kubectl get pod large-workflow-t696s-sleep-2611185570
NAME                                    READY   STATUS      RESTARTS   AGE
large-workflow-t696s-sleep-2611185570   0/2     Completed   0          46m

Version(s)

v3.4.12

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

a large workflow

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@shuangkun shuangkun self-assigned this Sep 9, 2024
@agilgur5 agilgur5 added area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P3 Low priority labels Sep 9, 2024
@agilgur5 agilgur5 changed the title Workflow has pending pod should can be retry when failed Workflow with pending pod should be retriable when failed Sep 9, 2024
@agilgur5
Copy link

The actual situation is that the corresponding pod has been completed. Because the workflow failed, the corresponding information was not updated. I think we should support retry in this case.

I don't think this is correct, that sounds like a Controller bug with the Workflow not being correctly tracked as completed. You are on an older version too.

The Workflow should be stopped before being retried, otherwise that can cause very very unpredictable race conditions. Not to mention that the retry process can delete Pods (you do still have #12734 which still needs further iteration and review), which would be even stranger.

All in all, this suggestion sounds like it would dramatically increase unpredictability, which is not good.

If anything, the root cause of the Controller not tracking correctly should be resolved. You're also missing a Workflow that reproduces in this issue too, please make sure to have a reproduction. It's not debuggable otherwise for a root cause analysis

@agilgur5 agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Sep 10, 2024
@shuangkun
Copy link
Member Author

This workflow is already in the Failed state. When it fails, the pod status changes will not be tracked, so the nodeStatus is still in the Pending state at the moment of Failed, but in fact the pod has since been Completed. Can we manually retry in this scenario? Otherwise, I have no way of knowing how to make my workflow successful. It has 150,000 pods and nearly 90% of the steps have been completed.

@agilgur5
Copy link

When it fails, the pod status changes will not be tracked, so the nodeStatus is still in the Pending state at the moment of Failed, but in fact the pod has since been Completed.

Ah I see, thanks for clarifying.

So in this case you can't "stop" the Workflow either since it is already considered "stopped"?

I'm still thinking this is a Controller bug. The Workflow should still be Running if it has Pending Pods and it should correctly track all Pods it ran. It would also need to signal a termination to that Pod once it's up per fastFail or similar logic

Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Sep 29, 2024
@shuangkun
Copy link
Member Author

Indeed, this should be a bug. It should not be in the Failed state before all pod states are processed.

@github-actions github-actions bot removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Oct 3, 2024
@agilgur5 agilgur5 changed the title Workflow with pending pod should be retriable when failed Workflow Failed but Pod still Pending Oct 10, 2024
@agilgur5 agilgur5 added area/controller Controller issues, panics and removed area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Oct 10, 2024
@agilgur5
Copy link

Indeed, this should be a bug. It should not be in the Failed state before all pod states are processed.

I've rewritten the title and adjusted the labels to match this

You're also missing a Workflow that reproduces in this issue too, please make sure to have a reproduction. It's not debuggable otherwise for a root cause analysis

This is still missing a reproduction though

@agilgur5 agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Oct 10, 2024
@shuangkun
Copy link
Member Author

This happened when I was testing a very large workflow. When invalid connections appeared multiple times when accessing the database, the workflow was directly set to Failed, but many pods were still pending and waiting to be scheduled, and they might continue to run later.

@agilgur5
Copy link

agilgur5 commented Oct 14, 2024

That's a description, but not a reproduction. Again, without a reproduction, no one can investigate or debug this.

v3.4.12

This is also not :latest nor the last release of 3.4.17

Please follow the issue templates in full, acting as a good role model to other users and contributors.

@jswxstw
Copy link
Member

jswxstw commented Oct 23, 2024

Indeed, this should be a bug. It should not be in the Failed state before all pod states are processed.

So, what is the reason for the workflow's failure? Why was it marked as failed directly?

@shuangkun
Copy link
Member Author

shuangkun commented Oct 23, 2024

The reason I encountered here was that access to the database failed, and the direct task was set to fail.

@jswxstw
Copy link
Member

jswxstw commented Oct 25, 2024

@shuangkun Does the problematic workflow use template failFast and parallelism?
Can you check if it's related to issue: #13806?

@tooptoop4
Copy link
Contributor

@shuangkun where da controller logs

Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Nov 20, 2024
@jchacks
Copy link

jchacks commented Dec 2, 2024

We have also been experiencing a similar issue where when stopping a workflow from the UI causes the Worklow Phase to be set to Failed but some of the nodes have Pending status. The annoying thing is that when using a Semaphore to limit execution between different workflows. It seems that these Pending nodes still hold on to the lock stopping future runs from being able to start.

@github-actions github-actions bot removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics P3 Low priority type/bug
Projects
None yet
Development

No branches or pull requests

5 participants