-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3.5.10: Workflow stuck Running
after pod runs 2min and then is deleted not gracefully
#13533
Comments
Fixes argoproj#13533 Signed-off-by: oninowang <[email protected]>
Fixes argoproj#13533 Signed-off-by: oninowang <[email protected]>
This was a careless logic bug from my end, apoligies for that. With problem1, which would still be present even if this bug didn't exist, that is a fair point. |
Fixes argoproj#13533 (argoproj#13537) Signed-off-by: oninowang <[email protected]>
Fixes #13533 (#13537) Signed-off-by: oninowang <[email protected]>
I believe this bug is still present.
|
@ericblackburn these edge cases are hard to reproduce, could you please give us a workflow along with instructions to reproduce? |
The issue of why I was still seeing this behavior is documented in #13537 (comment). Problem and solution has been identified. |
Running
after pod runs 2min and then is deleted not gracefullyRunning
after pod runs 2min and then is deleted not gracefully
#13533 (#13798) Signed-off-by: isubasinghe <[email protected]>
#13533 (#13798) Signed-off-by: isubasinghe <[email protected]>
#13533 (#13798) Signed-off-by: isubasinghe <[email protected]>
@jswxstw which pr fix this problem,and how to produce it? |
@zhucan Problem 1 was caused by an incorrect comparison operator in the previous PR(see #13533 (comment)), which is fixed by #13537 and #13798 was further improved to handle more special cases. |
|
@zhucan This is what cannot be determined. However, |
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
#13454 mark the node as failed after a timeout and mark the workflowtaskresult as completed only when the pod is absent and the node has not been completed in
taskResultReconciliation
.If we want to check if pod is absent after a timeout, why use
<=
here?argo-workflows/workflow/controller/taskresult.go
Line 58 in fed83ca
I am very puzzled by this PR, see #13373 (comment).
I think this PR has two problems:
POD_ABSENT_TIMEOUT
), its task result will be marked as completed immediately. However, we cannot confirm that the pod did not exit gracefully and its task result may not have been observed by the controller yet.POD_ABSENT_TIMEOUT
), node will be marked asError
with messagepod deleted
and its task result will always be incomplete if pod did not exit gracefully.I reproduced the problem2 as below:
FinalizeOutput
logic to facilitate the simulation of abnormal exit midwayworkflow spec:
Version(s)
latest
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: