Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment variables to configure (shorten) Informer ResyncPeriods #13690

Open
tooptoop4 opened this issue Oct 1, 2024 · 7 comments
Open

Environment variables to configure (shorten) Informer ResyncPeriods #13690

tooptoop4 opened this issue Oct 1, 2024 · 7 comments
Labels
area/controller Controller issues, panics area/upstream This is an issue with an upstream dependency, not Argo itself solution/workaround There's a workaround, might not be great, but exists type/feature Feature request

Comments

@tooptoop4
Copy link
Contributor

tooptoop4 commented Oct 1, 2024

workflowResyncPeriod = 20 * time.Minute
is 20 minutes
podResyncPeriod = 30 * time.Minute
is 30 minutes
workflowTaskSetResyncPeriod = 20 * time.Minute
is 20 minutes
is 20 minutes

shortening might solve #13671 / #10947 (which is linked to a k8s client bug) / #12352
#1038 (comment)
#1416 (comment)
#568 (comment)
#532 (comment)
#3952

// If the pod was deleted, then it is possible that the controller never get another informer message about it.
// In this case, the workflow will only be requeued after the resync period (20m). This means
// workflow will not update for 20m. Requeuing here prevents that happening.

#4423

@tooptoop4 tooptoop4 added the type/feature Feature request label Oct 1, 2024
@agilgur5
Copy link

agilgur5 commented Oct 1, 2024

shortening might solve #13671 / #10947 (which is linked to a k8s client bug)

That would be a workaround, not a solution. Cache rebuilds are expensive, especially if you have a large amount of Workflows. We leave it at the k8s default, so if it's not tuned in Argo, making it user configurable is a bit confusing, to say the least.

There's also one of these for every informer

Also please fill out the issue templates in full, especially if you want to be a good role model to others.

@tooptoop4
Copy link
Contributor Author

tooptoop4 commented Oct 1, 2024

@agilgur5 can u clarify expensive in what terms? (k8s api calls, controller cpu/memory? something else?) that might be preferable than missing SLAs for me

from reading kubernetes/kubernetes#127964 and kubernetes/client-go#571 informer seems unreliable compared to list current state

so choice seems to be rely on events/cache for what workflows should be operated on (non-0 chance of some missing) vs simple list all workflows (guaranteed to have all)

@tooptoop4 tooptoop4 changed the title Environment variable to configure (shorten) workflowResyncPeriod Environment variable to configure (shorten) workflowResyncPeriod/podResyncPeriod Oct 1, 2024
@agilgur5
Copy link

agilgur5 commented Oct 1, 2024

All of the above. It can do a full relist, which is k8s API and network I/O expensive, and iterates through the entire cache, which uses CPU and memory. Depending on your usage, you might be able to see the rebuild as a clear spike in your metrics as with #12206 (comment)

In #12125 (comment) (I forgot that issue existed, very similar) and #13466 (comment) I linked to some readings upstream in kubernetes-client/java#725 (comment), this k8s SIG API Machinery Google Group thread, argoproj/gitops-engine#617 (comment). According to those, Informers are supposed to be quite stable now and no longer relist, although unclear if that applies outside of "core controllers".
But core controllers, kubebuilder, controller-runtime, etc all make heavy use of Informers, so they're an essential piece of k8s controllers upstream, and not necessarily something Argo should be working around if there are bugs.

I would say it's more an upstream issue if that even makes sense to expose to users, since it seems like k8s maintainers don't recommend changing the default for other tooling either.

that might be preferable than missing SLAs for me

that's a bit of a different question that is potentially worth exposing in its own right, although the argument against that would be that if Informers are acting up, your entire cluster is going to be having some problems, not just Argo

@agilgur5 agilgur5 changed the title Environment variable to configure (shorten) workflowResyncPeriod/podResyncPeriod Environment variables to configure (shorten) Informer ResyncPeriods Oct 1, 2024
@agilgur5 agilgur5 added solution/workaround There's a workaround, might not be great, but exists area/controller Controller issues, panics area/upstream This is an issue with an upstream dependency, not Argo itself labels Oct 1, 2024
@tooptoop4 tooptoop4 mentioned this issue Oct 2, 2024
4 tasks
@agilgur5 agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Oct 7, 2024
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Oct 24, 2024
@tooptoop4
Copy link
Contributor Author

/unrotten

@agilgur5
Copy link

/unrotten

This is still missing information...

@github-actions github-actions bot removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Oct 26, 2024
@tooptoop4
Copy link
Contributor Author

according to kubernetes/kubernetes#128183 (comment) not upstream issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/upstream This is an issue with an upstream dependency, not Argo itself solution/workaround There's a workaround, might not be great, but exists type/feature Feature request
Projects
None yet
Development

No branches or pull requests

2 participants