Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Mark task result as completed if pod has been deleted for a while. Fixes #13533 #13537

Merged
merged 1 commit into from
Sep 2, 2024

Conversation

jswxstw
Copy link
Member

@jswxstw jswxstw commented Aug 30, 2024

Fixes #13533

Motivation

see #13533 for details.

Modifications

Mark task result as completed if pod has been deleted for a while.

Verification

Corner case, hard to reproduce.

@jswxstw
Copy link
Member Author

jswxstw commented Aug 30, 2024

/retest

func podAbsentTimeout(node *wfv1.NodeStatus) bool {
return time.Since(node.StartedAt.Time) <= envutil.LookupEnvDurationOr("POD_ABSENT_TIMEOUT", 2*time.Minute)
func recentlyDeleted(node *wfv1.NodeStatus) bool {
return time.Since(node.FinishedAt.Time) <= envutil.LookupEnvDurationOr("RECENTLY_DELETED_POD_DURATION", 10*time.Second)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, the reason for using startedAt was because the finishedAt time may never have been recorded

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I reviewed this I'd read this as being a finished rather than a start time - so the intention here is is correct.

I think the current IsPodDeleted() guards this against not having a FinishedAt, but that is problematic. @isubasinghe and I did discuss this briefly before he disappeared for the weekend, and we'd like to have a proper look as this almost certainly breaks the part of #13454 which was to ensure 3.4->3.5 upgrade worked as hoped for in flight workflows.

Copy link
Member Author

@jswxstw jswxstw Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this almost certainly breaks the part of #13454 which was to ensure 3.4->3.5 upgrade worked as hoped for in flight workflows.

@Joibel Do you mean this problem #12103 (comment)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. @jswxstw do you think you would be able to make time for a chat about this issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this doesn't break the 3.4 -> 3.5 upgrade path, the else if check is present meaning we never falsely update the TaskCompletionStatus.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. @jswxstw do you think you would be able to make time for a chat about this issue?

Sure, I have time now.

Actually this doesn't break the 3.4 -> 3.5 upgrade path, the else if check is present meaning we never falsely update the TaskCompletionStatus.

Yes, I kept this fix.

Copy link
Member

@isubasinghe isubasinghe Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I have time now.

Its alright, I have enough confidence now that this is a better approach than #13454

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related: #13798 (comment)

@agilgur5 agilgur5 requested review from Joibel and isubasinghe August 30, 2024 13:59
@agilgur5 agilgur5 added the area/controller Controller issues, panics label Aug 30, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Aug 31, 2024
Copy link
Member

@isubasinghe isubasinghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should fix #13533.

I am unsure if this will fix all related issues however.

Confirmed that in flight workflows from 3.4 -> 3.5 upgrade path will work as expected.

Unfortunately, this path is impacted by another bug and will result in infinite recursion. There is a fix for this as well.

@jswxstw
Copy link
Member Author

jswxstw commented Sep 2, 2024

/retest

@jswxstw
Copy link
Member Author

jswxstw commented Sep 2, 2024

I am unsure if this will fix all related issues however.

@isubasinghe As far as I know, almost all the related issues are that wait container does not report LabelKeyReportOutputsCompleted as expected:

  1. wait container with old versions do not have the label LabelKeyReportOutputsCompleted, fixed by fix: mark node failed if pod absent. Fixes #12993 #13454
  2. wait container exits abnormally, including the following two situations:

@isubasinghe isubasinghe merged commit 8a67009 into argoproj:main Sep 2, 2024
28 checks passed
@isubasinghe
Copy link
Member

isubasinghe commented Sep 2, 2024

Makes sense, I've gone through the code in quite a lot of depth for all of today effectively and I came to a similar conclusion, your agreement clears my doubts, thanks.

@ericblackburn
Copy link

ericblackburn commented Sep 24, 2024

I've upgraded to 3.5.11, using chart 0.42.3, to use this bug fix but still see that workflows are stuck in a running state after a kubectl delete of the pod. Our k8s cluster is running v1.29.7. I waited 15 minutes for the state to change, fwiw.
image
image

@ericblackburn
Copy link

Some testing. If I only wait for the UI to show the Pod phase as pending and kubectl delete, it spins up a new pod just fine. If I wait until the phase is running and then kubectl delete, that is when it gets stuck.

Here are the argo workflow logs for when it gets stuck

time="2024-09-24T16:25:40.253Z" level=info msg="Processing workflow" Phase= ResourceVersion=2553070831 namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:40.272Z" level=info msg="Task-result reconciliation" namespace=testnamespace numObjs=0 workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:40.272Z" level=info msg="Updated phase  -> Running" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:40.272Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:40.273Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:40.273Z" level=info msg="Pod node testworkflow-1727192700-n4v9v initialized Pending" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:40.390Z" level=info msg="Created pod: testworkflow-1727192700-n4v9v (testworkflow-1727192700-n4v9v)" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:40.390Z" level=info msg="TaskSet Reconciliation" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:40.390Z" level=info msg=reconcileAgentPod namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:40.399Z" level=info msg="Workflow update successful" namespace=testnamespace phase=Running resourceVersion=2553070873 workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:50.349Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=2553070873 namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:50.349Z" level=info msg="Task-result reconciliation" namespace=testnamespace numObjs=0 workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:50.349Z" level=info msg="Workflow pod is missing" namespace=testnamespace nodeName=testworkflow-1727192700-n4v9v nodePhase=Pending recentlyStarted=false workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:50.427Z" level=info msg="Created pod: testworkflow-1727192700-n4v9v (testworkflow-1727192700-n4v9v)" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:50.427Z" level=info msg="TaskSet Reconciliation" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:25:50.427Z" level=info msg=reconcileAgentPod namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:00.429Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=2553070873 namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:00.429Z" level=info msg="Task-result reconciliation" namespace=testnamespace numObjs=0 workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:00.429Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:00.429Z" level=info msg="node changed" namespace=testnamespace new.message= new.phase=Running new.progress=0/1 nodeID=testworkflow-1727192700-n4v9v old.message= old.phase=Pending old.progress=0/1 workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:00.429Z" level=info msg="TaskSet Reconciliation" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:00.429Z" level=info msg=reconcileAgentPod namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:00.444Z" level=info msg="Workflow update successful" namespace=testnamespace phase=Running resourceVersion=2553075872 workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:25.888Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=2553075872 namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:25.888Z" level=info msg="Task-result reconciliation" namespace=testnamespace numObjs=0 workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:25.888Z" level=info msg="Workflow pod is missing" namespace=testnamespace nodeName=testworkflow-1727192700-n4v9v nodePhase=Running recentlyStarted=false workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:25.888Z" level=error msg="Mark error node" error="pod deleted" namespace=testnamespace nodeName=testworkflow-1727192700-n4v9v workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:25.888Z" level=info msg="node testworkflow-1727192700-n4v9v phase Running -> Error" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:25.888Z" level=info msg="node testworkflow-1727192700-n4v9v message: pod deleted" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:25.888Z" level=info msg="node testworkflow-1727192700-n4v9v finished: 2024-09-24 16:26:25.888587111 +0000 UTC" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:25.888Z" level=info msg="TaskSet Reconciliation" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:25.888Z" level=info msg=reconcileAgentPod namespace=testnamespace workflow=testworkflow-1727192700-n4v9v
time="2024-09-24T16:26:25.908Z" level=info msg="Workflow update successful" namespace=testnamespace phase=Running resourceVersion=2553082225 workflow=testworkflow-1727192700-n4v9v

@jswxstw
Copy link
Member Author

jswxstw commented Sep 27, 2024

time="2024-09-24T16:26:00.429Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=testnamespace workflow=testworkflow-1727192700-n4v9v

@ericblackburn I think you should fix the RBAC problem, which causing this PR not works.

@ericblackburn
Copy link

ericblackburn commented Sep 27, 2024

Yep, that was it. Good call out.
image

@isubasinghe
Copy link
Member

isubasinghe commented Oct 15, 2024

@jswxstw unfortunately I can confirm that the node.Message can be something other than "pod deleted" when a pod is killed, this means that your fix won't work. Do you have any ideas about how we can tackle this without checking the message?

This might be due to a race condition, I am currently unsure of how this happens.

@jswxstw
Copy link
Member Author

jswxstw commented Oct 15, 2024

Do you mean a pod will be deleted with a message not "pod deleted"?
If a pod is just killed but not deleted yet, #13491 will check the actual status of wait container.

@isubasinghe
Copy link
Member

@jswxstw sorry yeah I meant deleted

@isubasinghe
Copy link
Member

I think we should just see if the pod for that node disappeared in the IsPodDeleted function, opinions @jswxstw ?

@jswxstw
Copy link
Member Author

jswxstw commented Oct 15, 2024

I think we should just see if the pod for that node disappeared in the IsPodDeleted function, opinions @jswxstw ?

Yes, we can use woc.podExists instead to check if its pod has disappeared.
However, If a pod is deleted with a message not "pod deleted", it means controller has observed the pod status after podReconciliation and it should already know the actual status of wait container. Are there other scenarios I haven't considered?

@Joibel
Copy link
Member

Joibel commented Oct 15, 2024

I would like it to wait some time between observing a gone pod without a completed workflow task result and us kicking in the marking of the task result as completed. It could be that the completed WFTresult just hasn't been observed yet.

@jswxstw
Copy link
Member Author

jswxstw commented Oct 15, 2024

This might be due to a race condition, I am currently unsure of how this happens.

case c.State.Running != nil && new.Phase.Completed():
woc.log.WithField("new.phase", new.Phase).Info("leaving phase un-changed: wait container is not yet terminated ")
new.Phase = old.Phase

Under normal circumstances, the node can only be set to Completed when wait container has truly exited.
Therefore, in theory, we should always be able to obtain the actual state of wait container here, unless the pod is deleted before the controller has a chance to observe it. In such cases, it will be marked as 'pod deleted'.

@isubasinghe
Copy link
Member

Okay I found out what happens. If the node has been marked failed by some other failure and we are still waiting on task results, the message will never be updated to reflect pod deleted since podReconciliation will skip this node.

I suggest that markNodePhase should be responsible for patching marking the task result completed in this case.
I see no reason why we would want to wait around for a task result when the node has already been marked Failed/Errored?

@isubasinghe
Copy link
Member

Opinions @jswxstw ?

@jswxstw
Copy link
Member Author

jswxstw commented Oct 21, 2024

Okay I found out what happens. If the node has been marked failed by some other failure and we are still waiting on task results, the message will never be updated to reflect pod deleted since podReconciliation will skip this node.

@isubasinghe Do you mean the node is marked failed in podReconciliation? It will not bring any problems if so because controller will check the actual state of wait container.

I found that it will casue some problems if controller marks the node as failed directly regardless of the pod's status, see #13726 (comment). Are you encountering similar issues?

@isubasinghe
Copy link
Member

isubasinghe commented Oct 21, 2024

@jswxstw I mean if it is marked failed elsewhere, not in podReconciliation.

I think markNodePhase should include an if condition to check if the phase is error/failure and then markTaskResultsComplete(node.ID)

@jswxstw
Copy link
Member Author

jswxstw commented Oct 21, 2024

@jswxstw I mean if it is marked failed elsewhere, not in podReconciliation.

@isubasinghe Could you specify exactly where? In my opinion controller should not mark the node as failed directly regardless of the pod's status.

@isubasinghe
Copy link
Member

@jswxstw I wasn't able to find where exactly, I think what we should do is if we mark nodes failed, we should mark the task result complete.

Also if we get a false label in taskResultReconciliation we should first make sure the node hasn't failed or errored first, before labelling task result incomplete.

@jswxstw
Copy link
Member Author

jswxstw commented Oct 23, 2024

@isubasinghe I think there are risks in doing this, because at this point we cannot confirm whether the informer has watched task result updates, this process may be delayed.

@isubasinghe
Copy link
Member

@jswxstw but we don't care about a task result after we fail a node, correct? So it doesn't matter if we receive a task result or not.

@jswxstw
Copy link
Member Author

jswxstw commented Oct 23, 2024

@isubasinghe We may need the exact exit code of the node from the task result.

@isubasinghe
Copy link
Member

@jswxstw I think it should be okay after looking at the code.

We seem to only add to scope if the node succeeded:

if newState.Succeeded() && woc.wf.Status.IsTaskResultIncomplete(node.ID) {

So you are right that the second conditional will now return true, but he first condition still returns false. Therefore this behaviour should remain the same.

@isubasinghe
Copy link
Member

isubasinghe commented Oct 23, 2024

Steps/DAGs doesn't seem to care about task results. I wonder if this is a bug on that part of the codebase ?

It should probably wait till a parent's node's task result has arrived before creating the child.

@jswxstw
Copy link
Member Author

jswxstw commented Oct 23, 2024

So you are right that the second conditional will now return true, but he first condition still returns false. Therefore this behaviour should remain the same.

There is an inconsistency in the judgment logic in these two places:

if phase.Completed() && (woc.checkTaskResultsInProgress() || woc.hasDaemonNodes()) {

if newState.Succeeded() && woc.wf.Status.IsTaskResultIncomplete(node.ID) {

Steps/DAGs doesn't seem to care about task results. I wonder if this is a bug on that part of the codebase ?

Only nodes of pod type have task results, other nodes depend on their child nodes.

@isubasinghe
Copy link
Member

isubasinghe commented Oct 23, 2024

Only nodes of pod type have task results, other nodes depend on their child nodes.

Sorry I meant, pods created by the templates {Step, DAG}. They do not wait until a task result is complete before creating a child. I think this is a bug? Because a task could have a continueOn: Failed and the child could rely upon the parents exit code.

There is an inconsistency in the judgment logic in these two places:

Agreed, but the first is for the workflow phase, in this case you are right, it will enter this branch but that seems harmless ?
To transition to Completed it would have had to have happened previously for this branch to be hit, so all the branches would have been hit the last time the function was called.

The second one: There will be no change since line 1144 checks if the node succeeded.

@isubasinghe
Copy link
Member

In my PR I have proposed an alternative fix: #13798,
would you still prefer this?

@jswxstw
Copy link
Member Author

jswxstw commented Oct 23, 2024

Sorry I meant, pods created by the templates {Step, DAG}. They do not wait until a task result is complete before creating a child. I think this is a bug? Because a task could have a continueOn: Failed and the child could rely upon the parents exit code.

Agree, we may also check if task result is Incomplete when node is failed.

would you still prefer this?

This PR may exacerbate the above bug, but it is likely to have no impact, as nodes are rarely marked as failed directly outside of podReconciliation(this only occurs in the scenario of workflow shutdown).

@isubasinghe
Copy link
Member

but it is likely to have no impact

yeah this is what I think as well.

I do think we need to rethink the task result implementation entirely.

@isubasinghe
Copy link
Member

isubasinghe commented Oct 25, 2024

@jswxstw I made the change here #13798 to check if the pod exists instead.

It turns out we do care about failed nodes, this was why the test was failing.

I'm not sure if you'd agree with me, but I now think that this task result should maybe even be communicated via grpc or something like that to the controller directly.

@agilgur5
Copy link

agilgur5 commented Oct 25, 2024

I'm not sure if you'd agree with me, but I now think that this task result should maybe even be communicated via grpc or something like that to the controller directly.

Security-wise, the Executor should never need to nor be able to directly communicate with the Controller. The CRs like WorkflowTaskResults, WorkflowTaskSets, and WorkflowArtifactGCTasks, partly exist for that purpose, as well as for (temporary) durable storage in case of an outage. The Controller is also not listening on any port either.
That's a pretty fundamental part of the architecture, I definitely don't think that should be changed

@isubasinghe
Copy link
Member

isubasinghe commented Oct 25, 2024

Security-wise, the Executor should never need to nor be able to directly communicate with the Controller.

I don't really see the risk here though personally, could you elaborate how this creates a security risk.
I mean I don't really expect to make this change, but we want consistency.

We only have two choices here that I would be confident fixes this class of bugs:

a) communicate directly to the controller.
b) bypass the cache and hit the API directly

@agilgur5
Copy link

agilgur5 commented Oct 25, 2024

I don't really see the risk here though personally, could you elaborate how this creates a security risk.

The Controller would now be accessible via network from every namespace that Workflows runs in. Other than breaking network policies all over, this creates a whole new attack surface, as well as gives the Controller an API as an additional attack surface (and dependencies). Access to the Controller across namespaces also defeats a decent bit of the purpose of managed namespaces etc.
Anything in the cluster would be able to access that API as well, since the Executor is like any other Workflow Pod container; comparatively RBAC is used to control access to WorkflowTaskResults etc

That's a massive can of worms and fundamental shift in the architecture that I don't think should be changed.

@jswxstw
Copy link
Member Author

jswxstw commented Oct 25, 2024

I'm not sure if you'd agree with me, but I now think that this task result should maybe even be communicated via grpc or something like that to the controller directly.

I agree with @agilgur5. Argo, as a workflow engine for Kubernetes, is implemented based on asynchronous coordination. Transforming it into a synchronous mode is not in line with the Kubernetes style.

@isubasinghe
Copy link
Member

isubasinghe commented Oct 25, 2024

The Controller would now be accessible via network from every namespace that Workflows runs in.

Yeah that is fair enough.

That's a massive can of worms and fundamental shift in the architecture that I don't think should be changed.

Yeah I agree on this, I just don't see how to fix these issue with any sort of confidence. Was mostly making a statement out of frustration.

@jswxstw
Copy link
Member Author

jswxstw commented Oct 25, 2024

I feel that the reasons for these issues are twofold:

  1. one is that we have split the original atomic operation to two: podReconciliation and taskResultReconciliation
  2. the other is that we did not fully consider the scenarios involving the coordination order and the loss of resources to be coordinated.

After so many fixes, we have actually covered the vast majority of scenarios. If there are still any edge cases that haven't been considered, the probability of them occurring will be very low, and the risk is manageable.

@tooptoop4
Copy link
Contributor

too often i hear some limitation is because of k8s, really the executor pods seem like the only thing that benefit from k8s so that each task running gets benefit of isolated compute, apart from that i don't see why k8s can be blamed for anything else. state tracking can be done with a database, just like how airflow does. the executor can have a wrapper to write 'started' row to db when it begins (including pod name column) then when its done, based on exit code it updates db with DONE or FAIL. periodically check for ungraceful termination ('started' rows where no pod exists) and update db to 'EVICTED'

@agilgur5
Copy link

state tracking can be done with a database

There's very little difference between a DB and using k8s resources/etcd. That also doesn't solve Pod tracking, which already uses k8s.
And similarly to my comments above about Executor -> Controller connections, Executor -> DB connections also open up a can of worms around security

@agilgur5
Copy link

I feel that the reasons for these issues are twofold:

Yea that sounds about right to me, with non-graceful termination as part of 2. 1. can't be done atomically given the different Informers involved, so it needs a different form of async coordination with some assumptions baked in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v3.5.10: Workflow stuck Running after pod runs 2min and then is deleted not gracefully
6 participants