-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: optimise pod finalizers with merge patch and resourceVersion #12862
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improvement of #12413
For reference, that was improved in #12596, where we also suggested changing to an update
, which was then completed in #12632
The origin patch using a json patch, which may cause a http 422 error, and it is not idempotent.
After #12632, it uses an update
instead of a patch
, which has a ResourceVersion
in it which allows k8s to calculate the diff if there are multiple changes, or otherwise return a 409 conflict. So an update
should have the same effect.
Though a merge patch might be more efficient than an update
if it can provide the same guarantees
Regarding removing Pod finalizers if a Workflow has been deleted, as you mentioned in #12413 (comment), I believe the more appropriate way would be to also add a finalizer on the Workflow itself. That way a Workflow cannot be deleted until it has completed all reconciliation, which would include removing Pod finalizers.
If a user does a force delete or removes the finalizer manually, then it would be on them to remove corresponding Pod finalizers (e.g. with a script, as mentioned in #12413 (comment))
The "cron task" could cause orphaned resources, which is what a finalizer is intended to prevent.
@agilgur5 I have viewed #12596 and #12632, I think using
@agilgur5 Add a finalizer to workflow is indeed an available option, but why "cron task" could cause orphaned resources? Currently, the |
Yes I mentioned in my comment that if the same guarantees can be achieved, then this would indeed be an optimization. If a
There is an inherent race condition here, particularly with very quick running Pods: it may not have added its finalizer yet. And the statement is not necessarily true either, another Controller may only be responding to certain types of Pod modifications and therefore not during "pod creation". Getting the assumptions correct is very important, especially when dealing with race conditions.
We should never be touching a finalizer that is not our own. This applies for all Controllers in k8s, not just Argo. We should also not assume how other Controllers react and definitely should not dictate how other Controllers react if Argo messes with their finalizers. Argo shouldn't touch their finalizers to begin with as they are not controlled by Argo.
Which means it can be inconsistent.
If I delete my Workflow before it has finished cleaning up, I now have orphaned resources. A finalizer is intended to prevent this scenario. |
Every pod have a owner referece point to a workflow, if workflow is deleted, kubernetes garbage collector will delete the pods automatically. It is just that pod have finalizers on it, so it will still exists until the finalizer is removed. The "cron task" run very 15s to remove finalizers of pods which owner workflow has been deleted (namely orphaned resource). The "cron task" will also remove finalizers of normal pods (that owner workflow still exists) when it is terminated for a while , just in case of there are have lot of workflows that it can't handle in time, or controller restart infinitely due to bugs or something like leader election timeout, etc. Although, "cron task" might not be a appropriate native way, but it is a simple and reliable way. About Patch or Get & Update, I think we can make if configurable, so that user can make a choice based on their real circumstances. |
I have found merge patch can also follow the resourceVersion check if we pass it to request. we can verify it by: kubectl run nginx --image=nginx
kubectl get pod nginx -oyaml | grep resourceVersion
# pass the resourceVersion, execute the command below twice or more times
kubectl -v 8 patch --type=merge pod nginx -p '{"metadata":{"finalizers":[],"resourceVersion":"24472721"}}' I have also increased the default pod gc workers count, lest the speed of deletion is slower than the speed of creation in case of api throttling. |
…r sync cron Signed-off-by: 刘达 <[email protected]>
Hey all, sorry I haven’t had a chance to look at this yet. I’ll try to take a look later this week, but also feel free to act independently according to what you think @agilgur5. Thanks! |
Signed-off-by: 刘达 <[email protected]>
Okay, just read through the thread. I'm okay with switching the The other thing you're saying to remove is the So, as far as the cron, it does seem like using a Workflow finalizer would be a more elegant solution, and I think would enable us to remove this code which occurs when the deletion timestamp gets added, right? Unlike the cron approach, it may cause some Pods that were supposed to be deleted when they completed to exist longer than they would if say a Workflow doesn't get deleted for a long time, but I think we are looking at a fairly rare case, right? and are just trying to ensure that these Pods get cleaned up eventually? |
Yes, of course, using workflow finalizer is a more elegent way. The reason why I choose to use a cron is that I have encountered some situations like apiserver OOM, workflow leader election timeouts (due to api calls cost long time), leading to workflow-controller always restarts. And after every restart, may not all workflows can get reconciled, then it restarts again. So in a worst case, there may be some workflows can never get reconciled. In the cron, DeleteFunc is not the only change to remove the finalizers, it will also check the pods that finished over a timeout duration, and then remove the fianlizer on it. This cron will be executed immediately after workflow started, and then every 15s. This is the currently situation. If in a finalizer way, I think I need some more time to do a research to find a solution to avoid the above cases. Hoping more advice can be gave. |
Okay, let me try to understand the issues. So, you mention the workflow leader election timeouts. Is this due to rate limiting occurring either from the K8S side or the client side? And how is this affecting the apiserver memory? Are you essentially saying that you need the immediacy of the cron vs having to wait until the workflow's been deleted? |
The apiserver is already in really high memory usage and low etcd performance before we upgrade, the workflow-controller restarts frequently (it is leader election timeout due to etcd bad performance), so after we upgrade, it still always restarts. There are hundreds of workflows running simultaneously in our clusters, through the monitoring, we observed each workflow operation time can take 3 seconds or more. So before all workflows get reconciled, workflow-controller may restart again, leading pod finalizer didn't get removed opportunely. And so with the number of pods increasing, apiserver got OOM. So I treat the fianalizer removal as the first priority thing. And so far, in the finalizer way, I think the workflow queue's sort policy can be changed, making the workflow with non-zero DeletionTimestamp Popped and handled firstly. Even though, we still can't ensure the number of pods with finalizer will not increase continuously, so a util cheking the number of that pods may be introduced, once the number reachs some threshold, we can skip handling for a while, and wait the number dropped. |
Sorry, can you clarify why it should be a cron vs just something done at start up? (since it sounded like it was an issue caused when the Workflow Controller is down) |
It's something should be done at start up. |
Okay, if that's the case, then what do you and others think about just doing this at start up time, like maybe before the pod deletion workers have started? |
The key point here is to ensure finalizers can be removed in any case (eg, controlled restart frequently, apiserver oom, api throtling...). Do it at start time is a way of handling controller restarts case. This may be not elegant, as I said above, if using workflow finalizers to do pod finalizer removal, we can sort workflows by DeletionTimestamp, so the associated pod's finalizer can also be removed as soon as controller starts. |
So, my understanding is that if there are any issues that occur during the Workflow operation (deletion queue), there will be retrying if there's any transient error here. So, if that's the case, then we just need to handle the case of Workflows having been deleted while the Controller was down, right?
Sorry, to clarify here, I wasn't referring to the "Workflow Finalizer" method. I just meant running your current cron function once at startup, rather than on repeat. Does it make sense or no? |
Let's make sure we finish the It shouldn't be configurable -- it should always be correct. If the patch will never delete a different finalizer, then we should use the patch, as it is more optimal -- there's no reason not to. I think that is more straightforward as well -- if it is correct and optimal, we can all agree and merge it in. The "cron task" still has more discussion and iteration, and is altogether independent of
Same for the "cron task", correctness first. Once that is guaranteed, we can optimize for these rarer high load scenarios.
Yea if I'm understanding this correctly, this is the unhandled edge case. "Down" including the Controller being too busy.
This makes sense to me and I agree generally speaking. This is a race condition, but yes, under high load, this makes the Controller effectively recover. I believe we should be able to get the number of Pods from the Informer cache (i.e. without another network request). Implementing this is a little non-trivial though, this is basically user-space prioritization of running goroutines. If we can implement that coordination-free way, then I'm all for it. Basically the "priority" of clean up goroutines should be higher than Workflow processing -- then we wouldn't need to implement a "skip" mechanism either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments in-line below mostly with regard to the merge patch -- I did not look at the "cron task" as we're not quite agreed on the approach of that to begin with
cmd/workflow-controller/main.go
Outdated
@@ -183,7 +183,7 @@ func NewRootCommand() *cobra.Command { | |||
command.Flags().StringVar(&logFormat, "log-format", "text", "The formatter to use for logs. One of: text|json") | |||
command.Flags().IntVar(&workflowWorkers, "workflow-workers", 32, "Number of workflow workers") | |||
command.Flags().IntVar(&workflowTTLWorkers, "workflow-ttl-workers", 4, "Number of workflow TTL workers") | |||
command.Flags().IntVar(&podCleanupWorkers, "pod-cleanup-workers", 4, "Number of pod cleanup workers") | |||
command.Flags().IntVar(&podCleanupWorkers, "pod-cleanup-workers", 32, "Number of pod cleanup workers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also increased the default pod gc workers count, lest the speed of deletion is slower than the speed of creation in case of api throttling.
Your rationale makes sense to me, but I am a bit concerned about doing this within a patch fix -- in particular from 4 -> 32 is a pretty large jump.
Also by the same logic we should probably increase the default of workflow-ttl-workers
too
…, remove unused code Signed-off-by: 刘达 <[email protected]>
…, remove unused code Signed-off-by: 刘达 <[email protected]>
…, optimise codes and logics Signed-off-by: 刘达 <[email protected]>
@agilgur5 Could you please review it again! I updated the codes as you suggested. After this, I will create a new issue to address how to ensure finalizers will be removed under kinds of edge cases . |
…, fix lint error Signed-off-by: 刘达 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few more small follow-ups
…, optimise codes Signed-off-by: 刘达 <[email protected]>
…, add nil finalaizers check Signed-off-by: 刘达 <[email protected]>
…, update slices compare method Signed-off-by: 刘达 <[email protected]>
…, fix CI test Signed-off-by: 刘达 <[email protected]>
if err := wfc.enablePodForDeletion(ctx, pods, p.Namespace, p.Name, false); err != nil { | ||
log.WithError(err).Error("Failed to enable pod for deletion") | ||
if slice.ContainsString(p.Finalizers, common.FinalizerPodStatus) { | ||
wfc.queuePodForCleanup(p.Namespace, p.Name, removeFinalizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would deletePod
not make sense here? as opposed to creating a new removeFinalizer
action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is caused by this case:
argo-workflows/test/e2e/pod_cleanup_test.go
Line 120 in f0a867b
WaitForPod(fixtures.PodCompleted) |
Workflow get deleted, so pod get deleted, too, so waiting pod completed wouldn't happen, but I have not found who deleted it and when. About the failure in Windows Tests, I am testing it in a virtual machine, but, i am struggling with network problems(see my location...). I need some help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ignore the Windows tests, they're not required to pass as they're still unstable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would
deletePod
not make sense here? as opposed to creating a newremoveFinalizer
action?
So by far, if I should change it back to deletePod, or keep current solution?
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: 刘达 <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: 刘达 <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: 刘达 <[email protected]>
Signed-off-by: 刘达 <[email protected]>
Signed-off-by: 刘达 <[email protected]>
Signed-off-by: 刘达 <[email protected]>
Merged in #13776 |
Improvement of #12413
Motivation
The origin patch using a json patch, which may cause a http 422 error, and it is not idempotent. And it needs a cron task to remove finalizers of workflows has been deleted when workflow-controller is down, or there may be lot of orphaned pods.
But, using finalizer mechanism to prevent pod deletionn is still a dangerous method, there may be some potential risk, for example, when load is high, workflow-controller may be restart, so can't work appropriately, and pod creation speed may be faster than deletion (may need change clean pod-cleanup-workers count), etc.
Modifications
getPodFromAPI()
method, which may heavy the load of apiserverVerification
I have tested in my local environment by creating, deleting workflows in case of workflow-controller running and not running.