Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: optimise pod finalizers with merge patch and resourceVersion #12862

Closed
wants to merge 24 commits into from

Conversation

imliuda
Copy link
Contributor

@imliuda imliuda commented Mar 30, 2024

Improvement of #12413

Motivation

The origin patch using a json patch, which may cause a http 422 error, and it is not idempotent. And it needs a cron task to remove finalizers of workflows has been deleted when workflow-controller is down, or there may be lot of orphaned pods.

But, using finalizer mechanism to prevent pod deletionn is still a dangerous method, there may be some potential risk, for example, when load is high, workflow-controller may be restart, so can't work appropriately, and pod creation speed may be faster than deletion (may need change clean pod-cleanup-workers count), etc.

Modifications

  1. change patch type from json patch to merge patch
  2. add a cron task to remove finalizers due to workflow deleted, and after timeout
  3. reuse pod clean up queue to delete pods
  4. delete getPodFromAPI() method, which may heavy the load of apiserver

Verification

I have tested in my local environment by creating, deleting workflows in case of workflow-controller running and not running.

workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement of #12413

For reference, that was improved in #12596, where we also suggested changing to an update, which was then completed in #12632

The origin patch using a json patch, which may cause a http 422 error, and it is not idempotent.

After #12632, it uses an update instead of a patch, which has a ResourceVersion in it which allows k8s to calculate the diff if there are multiple changes, or otherwise return a 409 conflict. So an update should have the same effect.

Though a merge patch might be more efficient than an update if it can provide the same guarantees


Regarding removing Pod finalizers if a Workflow has been deleted, as you mentioned in #12413 (comment), I believe the more appropriate way would be to also add a finalizer on the Workflow itself. That way a Workflow cannot be deleted until it has completed all reconciliation, which would include removing Pod finalizers.
If a user does a force delete or removes the finalizer manually, then it would be on them to remove corresponding Pod finalizers (e.g. with a script, as mentioned in #12413 (comment))

The "cron task" could cause orphaned resources, which is what a finalizer is intended to prevent.

docs/environment-variables.md Outdated Show resolved Hide resolved
@agilgur5 agilgur5 requested review from Joibel and juliev0 March 30, 2024 16:40
@agilgur5 agilgur5 changed the title fix: optimise pod finalizers by using merge patch, and add a finalize… fix: optimise pod finalizers by using merge patch, and add a finalizer sync cron Mar 30, 2024
@agilgur5 agilgur5 changed the title fix: optimise pod finalizers by using merge patch, and add a finalizer sync cron fix: optimise pod finalizers with merge patch + add finalizer sync cron Mar 30, 2024
@agilgur5 agilgur5 added the area/controller Controller issues, panics label Mar 30, 2024
@imliuda
Copy link
Contributor Author

imliuda commented Mar 31, 2024

@agilgur5 I have viewed #12596 and #12632, I think using Get() and Update() for each pod is not a good way, eg, if pod creation rate is 10/s, each pod's size is 200KB (usually it has very big environment vars), then the additional bandwidth will be 4MB/s, this is a waste. @juliev0 how do you think about this? If there really is a controller will add finalizer to argo pods, then it should have been added when pod creation, if other controller has removed it's own finalizer, we may add it to pods again by Patch, the other controller could just remove it one more time.

The "cron task" could cause orphaned resources, which is what a finalizer is intended to prevent.

@agilgur5 Add a finalizer to workflow is indeed an available option, but why "cron task" could cause orphaned resources? Currently, the syncPodFinalizer() is a just simple way to ensure pod finalizers can be removed, it is more like a supervisor, it does't care about workflows statuses, controller's heathy status, etc.

@agilgur5
Copy link

agilgur5 commented Mar 31, 2024

then the additional bandwidth will be 4MB/s, this is a waste.

Yes I mentioned in my comment that if the same guarantees can be achieved, then this would indeed be an optimization. If a patch can be done without creating conflicts, that would be more optimal.

If there really is a controller will add finalizer to argo pods, then it should have been added when pod creation

There is an inherent race condition here, particularly with very quick running Pods: it may not have added its finalizer yet.

And the statement is not necessarily true either, another Controller may only be responding to certain types of Pod modifications and therefore not during "pod creation".

Getting the assumptions correct is very important, especially when dealing with race conditions.

the other controller could just remove it one more time.

We should never be touching a finalizer that is not our own. This applies for all Controllers in k8s, not just Argo. We should also not assume how other Controllers react and definitely should not dictate how other Controllers react if Argo messes with their finalizers. Argo shouldn't touch their finalizers to begin with as they are not controlled by Argo.
This behavior would violate correctness as well as separation of duties.

it does't care about workflows statuses, controller's heathy status, etc.

Which means it can be inconsistent.

but why "cron task" could cause orphaned resources?

If I delete my Workflow before it has finished cleaning up, I now have orphaned resources. A finalizer is intended to prevent this scenario.

@imliuda
Copy link
Contributor Author

imliuda commented Apr 1, 2024

If I delete my Workflow before it has finished cleaning up, I now have orphaned resources. A finalizer is intended to prevent this scenario.

Every pod have a owner referece point to a workflow, if workflow is deleted, kubernetes garbage collector will delete the pods automatically. It is just that pod have finalizers on it, so it will still exists until the finalizer is removed. The "cron task" run very 15s to remove finalizers of pods which owner workflow has been deleted (namely orphaned resource).

The "cron task" will also remove finalizers of normal pods (that owner workflow still exists) when it is terminated for a while , just in case of there are have lot of workflows that it can't handle in time, or controller restart infinitely due to bugs or something like leader election timeout, etc.

Although, "cron task" might not be a appropriate native way, but it is a simple and reliable way.

About Patch or Get & Update, I think we can make if configurable, so that user can make a choice based on their real circumstances.

@imliuda
Copy link
Contributor Author

imliuda commented Apr 4, 2024

I have found merge patch can also follow the resourceVersion check if we pass it to request. we can verify it by:

kubectl run nginx --image=nginx
kubectl get pod nginx -oyaml | grep resourceVersion
# pass the resourceVersion, execute the command below twice or more times
kubectl -v 8 patch --type=merge pod  nginx -p '{"metadata":{"finalizers":[],"resourceVersion":"24472721"}}'

I have also increased the default pod gc workers count, lest the speed of deletion is slower than the speed of creation in case of api throttling.

@juliev0
Copy link
Contributor

juliev0 commented Apr 4, 2024

Hey all, sorry I haven’t had a chance to look at this yet. I’ll try to take a look later this week, but also feel free to act independently according to what you think @agilgur5. Thanks!

@juliev0
Copy link
Contributor

juliev0 commented Apr 5, 2024

Okay, just read through the thread. I'm okay with switching the Update() with a merge patch to reduce transmitted bytes.

The other thing you're saying to remove is the Get() which is made as part of the getPodFromAPI() call. Want to check with @sakai-ast (if he remembers), as I know there was some behavior seen that caused adding this.

So, as far as the cron, it does seem like using a Workflow finalizer would be a more elegant solution, and I think would enable us to remove this code which occurs when the deletion timestamp gets added, right? Unlike the cron approach, it may cause some Pods that were supposed to be deleted when they completed to exist longer than they would if say a Workflow doesn't get deleted for a long time, but I think we are looking at a fairly rare case, right? and are just trying to ensure that these Pods get cleaned up eventually?

@imliuda
Copy link
Contributor Author

imliuda commented Apr 6, 2024

Yes, of course, using workflow finalizer is a more elegent way. The reason why I choose to use a cron is that I have encountered some situations like apiserver OOM, workflow leader election timeouts (due to api calls cost long time), leading to workflow-controller always restarts. And after every restart, may not all workflows can get reconciled, then it restarts again. So in a worst case, there may be some workflows can never get reconciled.

In the cron, DeleteFunc is not the only change to remove the finalizers, it will also check the pods that finished over a timeout duration, and then remove the fianlizer on it. This cron will be executed immediately after workflow started, and then every 15s. This is the currently situation.

If in a finalizer way, I think I need some more time to do a research to find a solution to avoid the above cases. Hoping more advice can be gave.

@juliev0
Copy link
Contributor

juliev0 commented Apr 6, 2024

Yes, of course, using workflow finalizer is a more elegent way. The reason why I choose to use a cron is that I have encountered some situations like apiserver OOM, workflow leader election timeouts (due to api calls cost long time), leading to workflow-controller always restarts. And after every restart, may not all workflows can get reconciled, then it restarts again. So in a worst case, there may be some workflows can never get reconciled.

Okay, let me try to understand the issues. So, you mention the workflow leader election timeouts. Is this due to rate limiting occurring either from the K8S side or the client side? And how is this affecting the apiserver memory?

Are you essentially saying that you need the immediacy of the cron vs having to wait until the workflow's been deleted?

@imliuda
Copy link
Contributor Author

imliuda commented Apr 7, 2024

The apiserver is already in really high memory usage and low etcd performance before we upgrade, the workflow-controller restarts frequently (it is leader election timeout due to etcd bad performance), so after we upgrade, it still always restarts. There are hundreds of workflows running simultaneously in our clusters, through the monitoring, we observed each workflow operation time can take 3 seconds or more. So before all workflows get reconciled, workflow-controller may restart again, leading pod finalizer didn't get removed opportunely. And so with the number of pods increasing, apiserver got OOM.

So I treat the fianalizer removal as the first priority thing. And so far, in the finalizer way, I think the workflow queue's sort policy can be changed, making the workflow with non-zero DeletionTimestamp Popped and handled firstly. Even though, we still can't ensure the number of pods with finalizer will not increase continuously, so a util cheking the number of that pods may be introduced, once the number reachs some threshold, we can skip handling for a while, and wait the number dropped.

@juliev0
Copy link
Contributor

juliev0 commented Apr 7, 2024

Sorry, can you clarify why it should be a cron vs just something done at start up? (since it sounded like it was an issue caused when the Workflow Controller is down)

@imliuda
Copy link
Contributor Author

imliuda commented Apr 7, 2024

It's something should be done at start up.

@juliev0
Copy link
Contributor

juliev0 commented Apr 7, 2024

Okay, if that's the case, then what do you and others think about just doing this at start up time, like maybe before the pod deletion workers have started?

@imliuda
Copy link
Contributor Author

imliuda commented Apr 7, 2024

The key point here is to ensure finalizers can be removed in any case (eg, controlled restart frequently, apiserver oom, api throtling...).

Do it at start time is a way of handling controller restarts case. This may be not elegant, as I said above, if using workflow finalizers to do pod finalizer removal, we can sort workflows by DeletionTimestamp, so the associated pod's finalizer can also be removed as soon as controller starts.

@juliev0
Copy link
Contributor

juliev0 commented Apr 7, 2024

The key point here is to ensure finalizers can be removed in any case (eg, controlled restart frequently, apiserver oom, api throtling...).

So, my understanding is that if there are any issues that occur during the Workflow operation (deletion queue), there will be retrying if there's any transient error here. So, if that's the case, then we just need to handle the case of Workflows having been deleted while the Controller was down, right?

Do it at start time is a way of handling controller restarts case. This may be not elegant, as I said above, if using workflow finalizers to do pod finalizer removal, we can sort workflows by DeletionTimestamp, so the associated pod's finalizer can also be removed as soon as controller starts.

Sorry, to clarify here, I wasn't referring to the "Workflow Finalizer" method. I just meant running your current cron function once at startup, rather than on repeat. Does it make sense or no?

@agilgur5
Copy link

agilgur5 commented Apr 8, 2024

About Patch or Get & Update, I think we can make if configurable, so that user can make a choice based on their real circumstances.

Okay, just read through the thread. I'm okay with switching the Update() with a merge patch to reduce transmitted bytes.

Let's make sure we finish the Patch vs Update discussion.

It shouldn't be configurable -- it should always be correct. If the patch will never delete a different finalizer, then we should use the patch, as it is more optimal -- there's no reason not to.
If it's not correct, we shouldn't be doing it. Again, fundamentally, to make any assumptions and guarantees, you need to ensure correctness first.

I think that is more straightforward as well -- if it is correct and optimal, we can all agree and merge it in.

The "cron task" still has more discussion and iteration, and is altogether independent of Patch vs. Update, so it probably makes sense to split that off into a separate PR (and there's already a title with two subjects in the PR, which is often an indicator of mixed concerns).

but I think we are looking at a fairly rare case, right?

Are you essentially saying that you need the immediacy of the cron vs having to wait until the workflow's been deleted?

Same for the "cron task", correctness first. Once that is guaranteed, we can optimize for these rarer high load scenarios.
As an example, the situation the pod finalizer was added for is a lot more common than this.

So, if that's the case, then we just need to handle the case of Workflows having been deleted while the Controller was down, right?

Yea if I'm understanding this correctly, this is the unhandled edge case. "Down" including the Controller being too busy.
A workflow finalizer would prevent this scenario.

So I treat the fianalizer removal as the first priority thing. And so far, in the finalizer way, I think the workflow queue's sort policy can be changed, making the workflow with non-zero DeletionTimestamp Popped and handled firstly. Even though, we still can't ensure the number of pods with finalizer will not increase continuously, so a util cheking the number of that pods may be introduced, once the number reachs some threshold, we can skip handling for a while, and wait the number dropped. [sic]

This makes sense to me and I agree generally speaking. This is a race condition, but yes, under high load, this makes the Controller effectively recover. I believe we should be able to get the number of Pods from the Informer cache (i.e. without another network request).

Implementing this is a little non-trivial though, this is basically user-space prioritization of running goroutines. If we can implement that coordination-free way, then I'm all for it. Basically the "priority" of clean up goroutines should be higher than Workflow processing -- then we wouldn't need to implement a "skip" mechanism either.
Goroutines don't have priorities though. The very naive way would be to just have more clean up workers. An approach that uses a limited amount of coordination would be for the workflow workers to wait when there's too many pods that need clean up. I actually can't think of a coordination free approach off the top of my head without priorities 🤔

Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments in-line below mostly with regard to the merge patch -- I did not look at the "cron task" as we're not quite agreed on the approach of that to begin with

@@ -183,7 +183,7 @@ func NewRootCommand() *cobra.Command {
command.Flags().StringVar(&logFormat, "log-format", "text", "The formatter to use for logs. One of: text|json")
command.Flags().IntVar(&workflowWorkers, "workflow-workers", 32, "Number of workflow workers")
command.Flags().IntVar(&workflowTTLWorkers, "workflow-ttl-workers", 4, "Number of workflow TTL workers")
command.Flags().IntVar(&podCleanupWorkers, "pod-cleanup-workers", 4, "Number of pod cleanup workers")
command.Flags().IntVar(&podCleanupWorkers, "pod-cleanup-workers", 32, "Number of pod cleanup workers")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also increased the default pod gc workers count, lest the speed of deletion is slower than the speed of creation in case of api throttling.

Your rationale makes sense to me, but I am a bit concerned about doing this within a patch fix -- in particular from 4 -> 32 is a pretty large jump.

Also by the same logic we should probably increase the default of workflow-ttl-workers too

workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
@agilgur5 agilgur5 added this to the v3.6.0 milestone Apr 12, 2024
@imliuda
Copy link
Contributor Author

imliuda commented Apr 12, 2024

@agilgur5 Could you please review it again! I updated the codes as you suggested. After this, I will create a new issue to address how to ensure finalizers will be removed under kinds of edge cases .

Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more small follow-ups

workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
workflow/controller/controller_test.go Outdated Show resolved Hide resolved
@agilgur5
Copy link

Also two E2Es are failing. I haven't seen those flake before, although I did have a bit of a similar problem locally, not sure if it's related to your changes or not

@imliuda
Copy link
Contributor Author

imliuda commented Apr 14, 2024

Also two E2Es are failing. I haven't seen those flake before, although I did have a bit of a similar problem locally, not sure if it's related to your changes or not

I'm trying to find it in my local environment.

if err := wfc.enablePodForDeletion(ctx, pods, p.Namespace, p.Name, false); err != nil {
log.WithError(err).Error("Failed to enable pod for deletion")
if slice.ContainsString(p.Finalizers, common.FinalizerPodStatus) {
wfc.queuePodForCleanup(p.Namespace, p.Name, removeFinalizer)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would deletePod not make sense here? as opposed to creating a new removeFinalizer action?

Copy link
Contributor Author

@imliuda imliuda Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is caused by this case:

WaitForPod(fixtures.PodCompleted)

Workflow get deleted, so pod get deleted, too, so waiting pod completed wouldn't happen, but I have not found who deleted it and when. About the failure in Windows Tests, I am testing it in a virtual machine, but, i am struggling with network problems(see my location...). I need some help.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ignore the Windows tests, they're not required to pass as they're still unstable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would deletePod not make sense here? as opposed to creating a new removeFinalizer action?

So by far, if I should change it back to deletePod, or keep current solution?

workflow/controller/controller_test.go Outdated Show resolved Hide resolved
workflow/controller/controller_test.go Outdated Show resolved Hide resolved
Co-authored-by: Anton Gilgur <[email protected]>
Signed-off-by: 刘达 <[email protected]>
@Joibel
Copy link
Member

Joibel commented Oct 22, 2024

Merged in #13776

@Joibel Joibel closed this Oct 22, 2024
@agilgur5 agilgur5 added the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label Oct 22, 2024
@agilgur5
Copy link

Merged in #13776

@Joibel, as an Approver, you can usually push to existing PRs (unless the contributor has unticked the box) and so don't necessarily need to create a superseding PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants