-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: refactor runConfigMapWatcher
to use Informers. Fixes #11657
#11855
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one small comment, but otherwise looks good!
I don't see how this fixes #11463? The RetryWatcher was already listening to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bit of an issue here with the semaphore configmap notifications. See in-line below
workflow/controller/controller.go
Outdated
return cm.GetName() == wfc.configController.GetName() | ||
}, | ||
Handler: cache.ResourceEventHandlerFuncs{ | ||
UpdateFunc: func(_, obj interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically speaking, an AddFunc
and DeleteFunc
would be good too, though I don't think the previous RetryWatcher handled those scenarios either. They are also more complex, so I think this PR is probably good as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on my end, I was sick for a whole week and then playing catch up.
There's two main issues with the below code, per the in-line comments:
- There are potentially two different namespaces to watch, not just one
- There are three different types of ConfigMaps to watch, not just two
is this ok to be merged? |
No, my previous review has yet to be addressed |
Hey @juranir do you have any time to continue working on this next week? If not, I'd like to take over the work as #11657 has gotten to be one of the top issues (seemingly due to k8s 1.27 upgrades?). |
@agilgur5 any updates on this it's becoming very problematic. |
…mt ns. Fixes argoproj#11657 Signed-off-by: Juranir Santos <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up again @juranir! I had to get a solid block of time to re-familiarize myself with this code, but I reviewed everything now.
For other reviewers, note that turning on "Hide Whitespace" in the GH diff is very helpful (tbh, I am thinking of using a different reviewing UI entirely given how many complex PRs I review now and how many insufficiencies GH has).
This does look much better now. Summary of my in-line comments:
- Biggest thing -- this is missing the
runConfigMapWatcher
removal. That might've gotten lost in the rebase? - I think unconditionally creating both Informers would simplify the code. If we have two Informers on the same namespace, so be it. Informer resource usage tends to be more in the cache of data and filtering, rather than the overhead of the Informer itself.
- there are some missing namespace checks
- the label selector on the configmap informer needs moving
While I have the context, and given the priority & impact of this, I'm just going to go ahead and make the changes requested above myself and push them to the PR to make things faster (and reduce "telephone game" interpretations). In this review cycle I now have permissions to do that. |
Signed-off-by: Anton Gilgur <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have substantially simplified this now (including refactoring out some common plugin handling functions) and fixed all the correctness issues. So this should be "good enough" now.
But:
…orking on the managed ns. specify ctrl for controller instead Signed-off-by: Anton Gilgur <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
UpdateFunc: func(_, obj interface{}) { | ||
cm := obj.(*apiv1.ConfigMap) | ||
log.Infof("Received Workflow Controller config map %s/%s update", cm.GetNamespace(), cm.GetName()) | ||
wfc.UpdateConfig(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's an optimization that can be done here as well -- we can just pass the configmap directly. UpdateConfig
does a Get
, which is not necessary here.
that requires a bit of refactoring, including in the config controller, and is independent of this PR (the optimization probably could have been done before this PR, in the RetryWatcher
, as well), so I'm going to do that as a separate follow-up PR
- this way the plugin etc informer is more efficient, only on configmaps with the label - and the diff is (much) smaller (once you ignore whitespace from the inverted `if`) - semaphore informer only needs to watch certain configmaps -- get those names from the index - this is pretty hacky, I don't like this, but every way to track this seemed hacky 😕 - this option had the least changes though - semaphore informer only needs name and namespace -- remove the rest from its cache Signed-off-by: Anton Gilgur <[email protected]>
keys := wf.GetSemaphoreKeys() | ||
// store all keys while indexing as a side effect | ||
for _, key := range keys { | ||
semaphoreKeyMap[key] = struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty hacky, I don't like this, but every way to track this seemed hacky 😕
This option had the least changes though and is the most efficient (as opposed to pulling out semaphore keys another time), which is why I ended up going with it.
More ideally, we might want to require a label for Semaphore ConfigMaps. That would be a breaking change though, so can't do that within this fix.
workflow/controller/controller.go
Outdated
}).WithTransform(func(obj interface{}) (interface{}, error) { | ||
cm, ok := obj.(*apiv1.ConfigMap) | ||
if !ok { | ||
return obj, nil | ||
} | ||
|
||
// only leave name and namespace, remove the rest as we don't use it | ||
cm = cm.DeepCopy() | ||
cm.ObjectMeta = metav1.ObjectMeta{ | ||
Name: cm.Name, | ||
Namespace: cm.Namespace, | ||
} | ||
cm.Data = map[string]string{} | ||
cm.BinaryData = map[string][]byte{} | ||
return cm, nil | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's an external example of using a cache.TransformFunc
, for reference
Signed-off-by: Anton Gilgur <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
- the errors [only occur](https://github.com/kubernetes/client-go/blob/46588f2726fa3e25b1704d6418190f424f95a990/tools/cache/shared_informer.go#L580) when the informer was either already started or stopped - and we just created it, so we know that is not the case Signed-off-by: Anton Gilgur <[email protected]>
This reverts commit b3e3274. Signed-off-by: Anton Gilgur <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM to me now -- I've pretty much done every significant optimization that is specific and isolated to this change now
WATCH_CONTROLLER_SEMAPHORE_CONFIGMAPS
is not the most efficiently implemented (as I mentioned in the initial commit) as the Informers wouldn't need to be created at all in those cases (not just their event handlers), but given that's a workaround that could be removed (and env vars are explicitly documented as unstable), I didn't want to add too much code around it
There's also a comment I have above on the semaphore keys hackery in the indexer, but I don't think there's a better way to do that without a breaking change (e.g. requiring labels on semaphore configmaps)
I'm also going to build a test image and push to a test registry for folks to try and test out
Pushed |
if _, ok := wfc.executorPlugins[cm.GetNamespace()]; !ok { | ||
wfc.executorPlugins[cm.GetNamespace()] = map[string]*spec.Plugin{} | ||
} | ||
wfc.executorPlugins[cm.GetNamespace()][cm.GetName()] = p |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need thread safety between these functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if so, I see it wasn't there before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this part hasn't changed in this PR, it just got refactored into a common func. If you turn on "Hide Whitespace" on the GH diff, this line actually appears as unchanged (I mentioned the setting earlier, although it's not as useful after my refactor)
Technically speaking, they probably should be made thread-safe in case an apply
and delete
happen simultaneously, but realistically that will basically never happen. Also, this is ran on a single goroutine, so there aren't multiple threads modifying it. Safety is good to have just in case though, but not specific to the changes in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's run on a single goroutine? Oh, I figured the ResourceEventHandlerFuncs
spawned new goroutines - they don't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry I missed the notification for this comment; just manually checked and saw it.
I thought it would run on the same goroutine as the Informer's .Run
, but no I think you're correct, it does seem to spawn new goroutines. It was a little non-trivial to trace back, but this is the client-go
line that spawns goroutines (AddEventHandler
calls processor.addListener
and wait
Group
's Start
spawns it).
In any case though, as this thread-safety bug had existed beforehand, I don't plan to fix it in this PR. Adding locks would also conventionally entail creating a new, thread-safe data structure for the executor plugins which becomes a sizeable change in and of itself that detracts a bit from the purpose of this PR.
There's a couple other follow-ups from this PR stemming from incidental findings as well like #11855 (comment) and #11855 (comment), so we can add this to the list 😅
Have you reviewed the other parts of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. Let me go through the rest of the PR tomorrow. :)
Signed-off-by: Anton Gilgur <[email protected]>
Unfortunately, it looks like there are some file conflicts now. Another question: which of the different configmaps have been effectively tested (either manually, unit testing, or e2e)? |
Signed-off-by: Anton Gilgur <[email protected]>
Signed-off-by: Anton Gilgur <[email protected]>
It's still very necessary, although the upstream issue may have been resolved in k8s 1.30 if our hypotheses are correct, but even if that is accurate it may break again as it was due to a breaking bugfix in etcd that is now feature flagged due to the unexpected breakage. This change is still good to have and fixes other issues either way as well. But I never got around to writing the tests as you previously asked for (been busy with plenty of other RCAs, reviews, etc 🫠), and, as far as I could tell, there were no tests for this previously, which would explain why it wasn't caught during k8s upgrades, as well as why other ConfigMap regressions in our own code weren't caught previously |
Fixes #11657
Fixes #11463
Supersedes / Closes #11799
Motivation
After some time running, when the connection to k8s API is broken (timeout for example), the watcher cannot handle the exception increases the CPU usage, and floods the log with the message
level=error msg="invalid config map object received in config watcher. Ignored processing"
. After reaching the pod limit, it'll be restarted. See #11657 for details.Modifications
I replaced the function with a new one using informers (many thanks @agilgur5 ).
I tried to use the code pattern used by other functions, but please let me know if something can be improved.
I also changed the namespace used to "managed-namespace" instead of "namespace" (This change will solve #11463 )
Verification
Manual changes in the ConfigMap are being watched and applied;
An instance with this change is running for a while;
We also did some integration and performance tests;