v3.5.8: `workflow shutdown with strategy: Terminate`, but stuck in `Running` #13726

zhucan · 2024-10-08T09:29:07Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

workflow shutdown with strategy: Terminate， but the status of the workflow stuck running state

I expect the taskresults to be completed and the status of workflow not stuck Running state

Version(s)

v3.5.8

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

I don't remmber how to reproduce it.

Logs from the workflow controller

no any errors record

Logs from in your workflow's wait container

no any errors record

The text was updated successfully, but these errors were encountered:

zhucan · 2024-10-08T09:30:21Z

@jswxstw a little changes can update the status of the wf.

		if label == "false" && (old.IsPodDeleted() || old.FailedOrError()) {
			if recentlyDeleted(old) {
				woc.log.WithField("nodeID", nodeID).Debug("Wait for marking task result as completed because pod is recently deleted.")
				// If the pod was deleted, then it is possible that the controller never get another informer message about it.
				// In this case, the workflow will only be requeued after the resync period (20m). This means
				// workflow will not update for 20m. Requeuing here prevents that happening.
				woc.requeue()
				continue
			} else {
				woc.log.WithField("nodeID", nodeID).Info("Marking task result as completed because pod has been deleted for a while.")
				woc.wf.Status.MarkTaskResultComplete(nodeID)
			}
		}

Joibel · 2024-10-08T09:31:40Z

This should be fixed in 3.5.11.

zhucan · 2024-10-08T09:36:13Z

@Joibel Coud you paste the pr links?

jswxstw · 2024-10-08T09:40:17Z

@Joibel Coud you paste the pr links?

Related PR: #13491. Have you tested it with v3.5.11？@zhucan

zhucan · 2024-10-08T09:42:13Z

@jswxstw I had cherry-pick the pr to the v3.5.8, can not fix the problem

zhucan · 2024-10-08T09:43:47Z

the status of the pod is not pod.Status.Reason == "Evicted" @jswxstw

jswxstw · 2024-10-08T09:45:41Z

@jswxstw I had cherry-pick the pr to the v3.5.8, can not fix the problem

I'll check it out later.

jswxstw · 2024-10-08T11:53:30Z

@zhucan Please check if you have RBAC problem(see #13537 (comment)), the controller will rely on podReconciliation which is unreliable to synchronize taskresult status if so.

The root cause may be as below:

Node will be marked as Failed directly before pod is terminated when workflow is shutting down.
The order of pod cleanup policy may be: terminateContainers -> labelPodCompleted -> killContainers, which causing podReconciliation does not work because pod has been labeled as completed, so it can not be observed by controller.

agilgur5 · 2024-10-08T15:19:13Z

Version(s)

v3.5.8

I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.

You checked this off, but did not test with :latest. This should be fixed in 3.5.11, as Alan said. This is not optional.

I don't remmber how to reproduce it. [sic]

You also did not provide a reproduction nor logs, which makes this difficult if not impossible to investigate.

Please fill out the issue template accurately and in-full, it is there for a reason. It is not optional.

@jswxstw I had cherry-pick the pr to the v3.5.8, can not fix the problem

I've told you this before, that means you're running a fork, and we don't support forks (that's not possible by definition). You can file an issue in that fork.

zhucan · 2024-10-08T16:16:55Z

@zhucan Please check if you have RBAC problem(see #13537 (comment)), the controller will rely on podReconciliation which is unreliable to synchronize taskresult status if so.

I had checked the logs of the controller, there is no rbac warning informations. @jswxstw

zhucan · 2024-10-08T16:21:04Z

I've told you this before, that means you're running a fork, and we don't support forks (that's not possible by definition). You can file an issue in that fork.

we couldn't always to upgrade the version to the latest when there are some bugs exists under the version; we need to know which pr fix it, not upgrade the version when there is bug. because we don't know the new version whether exists other bugs. if you couldn't help to do it, no neeed to answer the question.

zhucan · 2024-10-08T16:43:18Z

Node will be marked as Failed directly before pod is terminated when workflow is shutting down.

argo-workflows/pkg/apis/workflow/v1alpha1/workflow_types.go

Line 2413 in 07703ab

func (n NodeStatus) IsPodDeleted() bool {

Node will be marked as Failed directly，but the error messages is not pod deleted but it is workflow shutdown with strategy: Terminate, the status is same but error messages is not same. @jswxstw

agilgur5 · 2024-10-08T17:27:28Z

we couldn't always to upgrade the version to the latest

The issue template asks that you, at minimum, check whether :latest resolves your bug. If it does, your bug has already been fixed and you can search through the changelog to see what fixed it.
Filing an issue despite that would be duplicative, as it very likely is here, and invalid, for not following the issue template.

when there are some bugs exists under the version; we need to know which pr fix it, not upgrade the version when there is bug. because we don't know the new version whether exists other bugs

You could say this of literally any software. Virtually all software has bugs. If you were to follow this and fork every dependency of yours, you wouldn't be doing anything other than dependency management (that is a big part of software development these days, but usually not the only thing). You're using Argo as a dependency, so if you update other dependencies to fix bugs, you would do the same with Argo.

if you couldn't help to do it, no neeed to answer the question.

That's not how OSS works -- you filed a bug report for a fork to the origin. Your bug report is therefore invalid as this is not that fork.
If you want to contribute to OSS or receive free community support, you should follow the rules and norms of OSS and that community, including following issue templates. You did not follow those.
Other communities and other repos may very well auto-close your issue with no response what-so-ever for not following templates and could even block you for repeatedly doing so.
Please do note that you are receiving free community support here, despite the fact that you did not follow rules repeatedly.

If you want support for a fork, you can pay a vendor for that. You should not expect community support from the origin for your own fork; that is neither possible (by definition) nor sustainable.

jswxstw · 2024-10-09T02:34:04Z

Node will be marked as Failed directly，but the error messages is not pod deleted but it is workflow shutdown with strategy: Terminate, the status is same but error messages is not same. @jswxstw

@zhucan This is a fix for #12993, #13533, which caused the waiting container to exit abnormally due to pod deletion. There are two releated pr: #13454, #13537.
You can see #13537 (comment) for a summary.

Workflow shutdown will not cause wait container exiting abnormally, so this issue should not exist in v3.5.8. I can't help more, since you provided very little information.

zhucan added the type/bug label Oct 8, 2024

agilgur5 added problem/more information needed Not enough information has been provide to diagnose this issue. solution/outdated This is not up-to-date with the current version area/controller Controller issues, panics labels Oct 8, 2024

agilgur5 changed the title ~~workflow shutdown with strategy: Terminate， but the status of the workflow stuck running state~~ workflow shutdown with strategy: Terminate, but stuck in Running Oct 8, 2024

agilgur5 changed the title ~~workflow shutdown with strategy: Terminate, but stuck in Running~~ workflow shutdown with strategy: Terminate, but stuck in Running Oct 8, 2024

agilgur5 closed this as completed Oct 8, 2024

agilgur5 changed the title ~~workflow shutdown with strategy: Terminate, but stuck in Running~~ 3.5.8: workflow shutdown with strategy: Terminate, but stuck in Running Oct 8, 2024

agilgur5 changed the title ~~3.5.8: workflow shutdown with strategy: Terminate, but stuck in Running~~ v3.5.8: workflow shutdown with strategy: Terminate, but stuck in Running Oct 8, 2024

agilgur5 added this to the v3.5.x patches milestone Oct 8, 2024

jswxstw mentioned this issue Oct 21, 2024

fix: Mark task result as completed if pod has been deleted for a while. Fixes #13533 #13537

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.5.8: `workflow shutdown with strategy: Terminate`, but stuck in `Running` #13726

v3.5.8: `workflow shutdown with strategy: Terminate`, but stuck in `Running` #13726

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

Joibel commented Oct 8, 2024

zhucan commented Oct 8, 2024

jswxstw commented Oct 8, 2024

zhucan commented Oct 8, 2024

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

jswxstw commented Oct 8, 2024

jswxstw commented Oct 8, 2024

agilgur5 commented Oct 8, 2024 •

edited

Loading

Version(s)

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

agilgur5 commented Oct 8, 2024 •

edited

Loading

jswxstw commented Oct 9, 2024 •

edited

Loading

v3.5.8: workflow shutdown with strategy: Terminate, but stuck in Running #13726

v3.5.8: workflow shutdown with strategy: Terminate, but stuck in Running #13726

Comments

zhucan commented Oct 8, 2024 • edited by agilgur5 Loading

Pre-requisites

What happened? What did you expect to happen?

Version(s)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

zhucan commented Oct 8, 2024 • edited by agilgur5 Loading

Joibel commented Oct 8, 2024

zhucan commented Oct 8, 2024

jswxstw commented Oct 8, 2024

zhucan commented Oct 8, 2024

zhucan commented Oct 8, 2024 • edited by agilgur5 Loading

jswxstw commented Oct 8, 2024

jswxstw commented Oct 8, 2024

agilgur5 commented Oct 8, 2024 • edited Loading

Version(s)

zhucan commented Oct 8, 2024 • edited by agilgur5 Loading

zhucan commented Oct 8, 2024 • edited by agilgur5 Loading

zhucan commented Oct 8, 2024 • edited by agilgur5 Loading

agilgur5 commented Oct 8, 2024 • edited Loading

jswxstw commented Oct 9, 2024 • edited Loading

v3.5.8: `workflow shutdown with strategy: Terminate`, but stuck in `Running` #13726

v3.5.8: `workflow shutdown with strategy: Terminate`, but stuck in `Running` #13726

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

agilgur5 commented Oct 8, 2024 •

edited

Loading

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

agilgur5 commented Oct 8, 2024 •

edited

Loading

jswxstw commented Oct 9, 2024 •

edited

Loading