Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.5.8: workflow shutdown with strategy: Terminate, but stuck in Running #13726

Closed
3 of 4 tasks
zhucan opened this issue Oct 8, 2024 · 14 comments
Closed
3 of 4 tasks
Labels
area/controller Controller issues, panics problem/more information needed Not enough information has been provide to diagnose this issue. solution/outdated This is not up-to-date with the current version type/bug

Comments

@zhucan
Copy link

zhucan commented Oct 8, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

image

workflow shutdown with strategy: Terminate, but the status of the workflow stuck running state

I expect the taskresults to be completed and the status of workflow not stuck Running state

Version(s)

v3.5.8

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

I don't remmber how to reproduce it.

Logs from the workflow controller

no any errors record

Logs from in your workflow's wait container

no any errors record
@zhucan zhucan added the type/bug label Oct 8, 2024
@zhucan
Copy link
Author

zhucan commented Oct 8, 2024

@jswxstw a little changes can update the status of the wf.

		if label == "false" && (old.IsPodDeleted() || old.FailedOrError()) {
			if recentlyDeleted(old) {
				woc.log.WithField("nodeID", nodeID).Debug("Wait for marking task result as completed because pod is recently deleted.")
				// If the pod was deleted, then it is possible that the controller never get another informer message about it.
				// In this case, the workflow will only be requeued after the resync period (20m). This means
				// workflow will not update for 20m. Requeuing here prevents that happening.
				woc.requeue()
				continue
			} else {
				woc.log.WithField("nodeID", nodeID).Info("Marking task result as completed because pod has been deleted for a while.")
				woc.wf.Status.MarkTaskResultComplete(nodeID)
			}
		}

@Joibel
Copy link
Member

Joibel commented Oct 8, 2024

This should be fixed in 3.5.11.

@zhucan
Copy link
Author

zhucan commented Oct 8, 2024

@Joibel Coud you paste the pr links?

@jswxstw
Copy link
Member

jswxstw commented Oct 8, 2024

@Joibel Coud you paste the pr links?

Related PR: #13491. Have you tested it with v3.5.11?@zhucan

@zhucan
Copy link
Author

zhucan commented Oct 8, 2024

@jswxstw I had cherry-pick the pr to the v3.5.8, can not fix the problem

@zhucan
Copy link
Author

zhucan commented Oct 8, 2024

the status of the pod is not pod.Status.Reason == "Evicted" @jswxstw

@jswxstw
Copy link
Member

jswxstw commented Oct 8, 2024

@jswxstw I had cherry-pick the pr to the v3.5.8, can not fix the problem

I'll check it out later.

@jswxstw
Copy link
Member

jswxstw commented Oct 8, 2024

@zhucan Please check if you have RBAC problem(see #13537 (comment)), the controller will rely on podReconciliation which is unreliable to synchronize taskresult status if so.

The root cause may be as below:

  • Node will be marked as Failed directly before pod is terminated when workflow is shutting down.
  • The order of pod cleanup policy may be: terminateContainers -> labelPodCompleted -> killContainers, which causing podReconciliation does not work because pod has been labeled as completed, so it can not be observed by controller.

@agilgur5 agilgur5 added problem/more information needed Not enough information has been provide to diagnose this issue. solution/outdated This is not up-to-date with the current version area/controller Controller issues, panics labels Oct 8, 2024
@agilgur5 agilgur5 changed the title workflow shutdown with strategy: Terminate, but the status of the workflow stuck running state workflow shutdown with strategy: Terminate, but stuck in Running Oct 8, 2024
@agilgur5 agilgur5 changed the title workflow shutdown with strategy: Terminate, but stuck in Running workflow shutdown with strategy: Terminate, but stuck in Running Oct 8, 2024
@agilgur5
Copy link

agilgur5 commented Oct 8, 2024

Version(s)

v3.5.8

  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.

You checked this off, but did not test with :latest. This should be fixed in 3.5.11, as Alan said. This is not optional.

I don't remmber how to reproduce it. [sic]

You also did not provide a reproduction nor logs, which makes this difficult if not impossible to investigate.

Please fill out the issue template accurately and in-full, it is there for a reason. It is not optional.

@jswxstw I had cherry-pick the pr to the v3.5.8, can not fix the problem

I've told you this before, that means you're running a fork, and we don't support forks (that's not possible by definition). You can file an issue in that fork.

@agilgur5 agilgur5 closed this as completed Oct 8, 2024
@zhucan
Copy link
Author

zhucan commented Oct 8, 2024

@zhucan Please check if you have RBAC problem(see #13537 (comment)), the controller will rely on podReconciliation which is unreliable to synchronize taskresult status if so.

I had checked the logs of the controller, there is no rbac warning informations. @jswxstw

@zhucan
Copy link
Author

zhucan commented Oct 8, 2024

I've told you this before, that means you're running a fork, and we don't support forks (that's not possible by definition). You can file an issue in that fork.

we couldn't always to upgrade the version to the latest when there are some bugs exists under the version; we need to know which pr fix it, not upgrade the version when there is bug. because we don't know the new version whether exists other bugs. if you couldn't help to do it, no neeed to answer the question.

@zhucan
Copy link
Author

zhucan commented Oct 8, 2024

  • Node will be marked as Failed directly before pod is terminated when workflow is shutting down.

func (n NodeStatus) IsPodDeleted() bool {
Node will be marked as Failed directly,but the error messages is not pod deleted but it is workflow shutdown with strategy: Terminate, the status is same but error messages is not same. @jswxstw

@agilgur5
Copy link

agilgur5 commented Oct 8, 2024

we couldn't always to upgrade the version to the latest

The issue template asks that you, at minimum, check whether :latest resolves your bug. If it does, your bug has already been fixed and you can search through the changelog to see what fixed it.
Filing an issue despite that would be duplicative, as it very likely is here, and invalid, for not following the issue template.

when there are some bugs exists under the version; we need to know which pr fix it, not upgrade the version when there is bug. because we don't know the new version whether exists other bugs

You could say this of literally any software. Virtually all software has bugs. If you were to follow this and fork every dependency of yours, you wouldn't be doing anything other than dependency management (that is a big part of software development these days, but usually not the only thing). You're using Argo as a dependency, so if you update other dependencies to fix bugs, you would do the same with Argo.

if you couldn't help to do it, no neeed to answer the question.

That's not how OSS works -- you filed a bug report for a fork to the origin. Your bug report is therefore invalid as this is not that fork.
If you want to contribute to OSS or receive free community support, you should follow the rules and norms of OSS and that community, including following issue templates. You did not follow those.
Other communities and other repos may very well auto-close your issue with no response what-so-ever for not following templates and could even block you for repeatedly doing so.
Please do note that you are receiving free community support here, despite the fact that you did not follow rules repeatedly.

If you want support for a fork, you can pay a vendor for that. You should not expect community support from the origin for your own fork; that is neither possible (by definition) nor sustainable.

@agilgur5 agilgur5 changed the title workflow shutdown with strategy: Terminate, but stuck in Running 3.5.8: workflow shutdown with strategy: Terminate, but stuck in Running Oct 8, 2024
@agilgur5 agilgur5 changed the title 3.5.8: workflow shutdown with strategy: Terminate, but stuck in Running v3.5.8: workflow shutdown with strategy: Terminate, but stuck in Running Oct 8, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Oct 8, 2024
@jswxstw
Copy link
Member

jswxstw commented Oct 9, 2024

Node will be marked as Failed directly,but the error messages is not pod deleted but it is workflow shutdown with strategy: Terminate, the status is same but error messages is not same. @jswxstw

@zhucan This is a fix for #12993, #13533, which caused the waiting container to exit abnormally due to pod deletion. There are two releated pr: #13454, #13537.
You can see #13537 (comment) for a summary.

Workflow shutdown will not cause wait container exiting abnormally, so this issue should not exist in v3.5.8. I can't help more, since you provided very little information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics problem/more information needed Not enough information has been provide to diagnose this issue. solution/outdated This is not up-to-date with the current version type/bug
Projects
None yet
Development

No branches or pull requests

4 participants