-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Mark task result as completed if pod has been deleted for a while. Fixes #13533 #13537
Conversation
Fixes argoproj#13533 Signed-off-by: oninowang <[email protected]>
/retest |
func podAbsentTimeout(node *wfv1.NodeStatus) bool { | ||
return time.Since(node.StartedAt.Time) <= envutil.LookupEnvDurationOr("POD_ABSENT_TIMEOUT", 2*time.Minute) | ||
func recentlyDeleted(node *wfv1.NodeStatus) bool { | ||
return time.Since(node.FinishedAt.Time) <= envutil.LookupEnvDurationOr("RECENTLY_DELETED_POD_DURATION", 10*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken, the reason for using startedAt
was because the finishedAt
time may never have been recorded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I reviewed this I'd read this as being a finished rather than a start time - so the intention here is is correct.
I think the current IsPodDeleted()
guards this against not having a FinishedAt, but that is problematic. @isubasinghe and I did discuss this briefly before he disappeared for the weekend, and we'd like to have a proper look as this almost certainly breaks the part of #13454 which was to ensure 3.4->3.5 upgrade worked as hoped for in flight workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this almost certainly breaks the part of #13454 which was to ensure 3.4->3.5 upgrade worked as hoped for in flight workflows.
@Joibel Do you mean this problem #12103 (comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. @jswxstw do you think you would be able to make time for a chat about this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this doesn't break the 3.4 -> 3.5 upgrade path, the else if check is present meaning we never falsely update the TaskCompletionStatus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. @jswxstw do you think you would be able to make time for a chat about this issue?
Sure, I have time now.
Actually this doesn't break the 3.4 -> 3.5 upgrade path, the else if check is present meaning we never falsely update the TaskCompletionStatus.
Yes, I kept this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I have time now.
Its alright, I have enough confidence now that this is a better approach than #13454
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: #13798 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code should fix #13533.
I am unsure if this will fix all related issues however.
Confirmed that in flight workflows from 3.4 -> 3.5 upgrade path will work as expected.
Unfortunately, this path is impacted by another bug and will result in infinite recursion. There is a fix for this as well.
/retest |
@isubasinghe As far as I know, almost all the related issues are that wait container does not report
|
Makes sense, I've gone through the code in quite a lot of depth for all of today effectively and I came to a similar conclusion, your agreement clears my doubts, thanks. |
Fixes argoproj#13533 (argoproj#13537) Signed-off-by: oninowang <[email protected]>
Fixes #13533 (#13537) Signed-off-by: oninowang <[email protected]>
I've upgraded to 3.5.11, using chart 0.42.3, to use this bug fix but still see that workflows are stuck in a running state after a kubectl delete of the pod. Our k8s cluster is running v1.29.7. I waited 15 minutes for the state to change, fwiw. |
Some testing. If I only wait for the UI to show the Pod phase as pending and kubectl delete, it spins up a new pod just fine. If I wait until the phase is running and then kubectl delete, that is when it gets stuck. Here are the argo workflow logs for when it gets stuck
|
@ericblackburn I think you should fix the RBAC problem, which causing this PR not works. |
@jswxstw unfortunately I can confirm that the This might be due to a race condition, I am currently unsure of how this happens. |
Do you mean a pod will be deleted with a message not "pod deleted"? |
@jswxstw sorry yeah I meant deleted |
I think we should just see if the pod for that node disappeared in the |
Yes, we can use |
I would like it to wait some time between observing a gone pod without a |
argo-workflows/workflow/controller/operator.go Lines 1459 to 1461 in cf6223d
Under normal circumstances, the node can only be set to |
Okay I found out what happens. If the node has been marked failed by some other failure and we are still waiting on task results, the message will never be updated to reflect I suggest that markNodePhase should be responsible for patching marking the task result completed in this case. |
Opinions @jswxstw ? |
@isubasinghe Do you mean the node is marked failed in I found that it will casue some problems if controller marks the node as failed directly regardless of the pod's status, see #13726 (comment). Are you encountering similar issues? |
@jswxstw I mean if it is marked failed elsewhere, not in podReconciliation. I think |
@isubasinghe Could you specify exactly where? In my opinion controller should not mark the node as failed directly regardless of the pod's status. |
@jswxstw I wasn't able to find where exactly, I think what we should do is if we mark nodes failed, we should mark the task result complete. Also if we get a false label in |
@isubasinghe I think there are risks in doing this, because at this point we cannot confirm whether the informer has watched task result updates, this process may be delayed. |
@jswxstw but we don't care about a task result after we fail a node, correct? So it doesn't matter if we receive a task result or not. |
@isubasinghe We may need the exact exit code of the node from the task result. |
@jswxstw I think it should be okay after looking at the code. We seem to only add to scope if the node succeeded: argo-workflows/workflow/controller/operator.go Line 1144 in bb5130e
So you are right that the second conditional will now return true, but he first condition still returns false. Therefore this behaviour should remain the same. |
Steps/DAGs doesn't seem to care about task results. I wonder if this is a bug on that part of the codebase ? It should probably wait till a parent's node's task result has arrived before creating the child. |
There is an inconsistency in the judgment logic in these two places: argo-workflows/workflow/controller/operator.go Line 2335 in bb5130e
argo-workflows/workflow/controller/operator.go Line 1144 in bb5130e
Only nodes of pod type have task results, other nodes depend on their child nodes. |
Sorry I meant, pods created by the templates {Step, DAG}. They do not wait until a task result is complete before creating a child. I think this is a bug? Because a task could have a
Agreed, but the first is for the workflow phase, in this case you are right, it will enter this branch but that seems harmless ? The second one: There will be no change since line 1144 checks if the node succeeded. |
In my PR I have proposed an alternative fix: #13798, |
Agree, we may also check if task result is Incomplete when node is failed.
This PR may exacerbate the above bug, but it is likely to have no impact, as nodes are rarely marked as failed directly outside of |
yeah this is what I think as well. I do think we need to rethink the task result implementation entirely. |
@jswxstw I made the change here #13798 to check if the pod exists instead. It turns out we do care about failed nodes, this was why the test was failing. I'm not sure if you'd agree with me, but I now think that this task result should maybe even be communicated via grpc or something like that to the controller directly. |
Security-wise, the Executor should never need to nor be able to directly communicate with the Controller. The CRs like |
I don't really see the risk here though personally, could you elaborate how this creates a security risk. We only have two choices here that I would be confident fixes this class of bugs: a) communicate directly to the controller. |
The Controller would now be accessible via network from every namespace that Workflows runs in. Other than breaking network policies all over, this creates a whole new attack surface, as well as gives the Controller an API as an additional attack surface (and dependencies). Access to the Controller across namespaces also defeats a decent bit of the purpose of managed namespaces etc. That's a massive can of worms and fundamental shift in the architecture that I don't think should be changed. |
I agree with @agilgur5. Argo, as a workflow engine for Kubernetes, is implemented based on asynchronous coordination. Transforming it into a synchronous mode is not in line with the Kubernetes style. |
Yeah that is fair enough.
Yeah I agree on this, I just don't see how to fix these issue with any sort of confidence. Was mostly making a statement out of frustration. |
I feel that the reasons for these issues are twofold:
After so many fixes, we have actually covered the vast majority of scenarios. If there are still any edge cases that haven't been considered, the probability of them occurring will be very low, and the risk is manageable. |
too often i hear some limitation is because of k8s, really the executor pods seem like the only thing that benefit from k8s so that each task running gets benefit of isolated compute, apart from that i don't see why k8s can be blamed for anything else. state tracking can be done with a database, just like how airflow does. the executor can have a wrapper to write 'started' row to db when it begins (including pod name column) then when its done, based on exit code it updates db with DONE or FAIL. periodically check for ungraceful termination ('started' rows where no pod exists) and update db to 'EVICTED' |
There's very little difference between a DB and using k8s resources/etcd. That also doesn't solve Pod tracking, which already uses k8s. |
Yea that sounds about right to me, with non-graceful termination as part of 2. 1. can't be done atomically given the different Informers involved, so it needs a different form of async coordination with some assumptions baked in |
Fixes #13533
Motivation
see #13533 for details.
Modifications
Mark task result as completed if pod has been deleted for a while.
Verification
Corner case, hard to reproduce.