-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RKE2 failing to start: fatal, Failed to apply network policy default-network-ingress-webhook-policy to namespace kube-system #5693
Comments
for reference, the issue that we have in the Sylva project about this issue: https://gitlab.com/sylva-projects/sylva-core/-/issues/1155 |
hello @brandond -- I see you commented at #4781 (comment) which is related to this issue here it seems to me that the class of possible cases where "RKE2 startup is prevented by a webhook acting on some API operation done before kube-proxy is ready" would need to be addressed ... could that be solved by changing when kube-proxy is setup ? |
RKE2 uses annotations on the system namespaces to track the state of various hardening processes that should only be performed once. Any products that deploy fail-closed webhooks that block modifications to the system namespaces are likely to break RKE2, if deployed before the hardening occurs, or during upgrades that make changes to the hardened policies. I personally think deploying fail-closed webhooks that block changes to core types, and hosting the webhook on a pod in the cluster it is protecting, is a bad idea. It is super common to end up with chicken-and-egg problems like this during a cold cluster restart - but it seems to be a reoccurring pattern across the ecosystem. We can evaluate changing how we track our hardening to avoid modifying the system namespaces, but this is unlikely to be changed soon. |
Rancher Server itself would I think fall in this category, right ?
This includes simple scenarios like:
My feeling here is that the central issue is that RKE2 won't start if some API actions that it wants to do trigger some fail-closed webhook. It seems to me that addressing this issue is needed beyond this Namespace-hardening-specific issue here, and that solving it would solve this issue among others. I don't disagree that perhaps "webhooks that block changes to core types, and hosting the webhook on a pod in the cluster it is protecting, is a bad idea", but given that this common place, in particular in the Rancher/RKE2 ecosystem, then isn't it worth making RKE2 more robust to this ? Also, as a side-node: the RKE2 hardening code simply annotates the Namespaces apparently simply to keep track that the network policies have been applied.
Last, today, some of those network policies will be applied even if the component that they relate to isn't enabled in RKE2 (e.g. the ingress-nginx network policies are applied even if ingress-nginx deployment by RKE2 is disabled). |
That is intentional. Once the policies are installed and the annotation added, RKE2 will not change them, so that administrators can modify them as necessary to suit their needs. The annotations can be removed to force RKE2 to re-sync the policies.
You are welcome to do that; once RKE2 has created them it will no longer modify them as long as the annotations on the NS remain in place. Like I said earlier, we can look at different ways to do this, but RKE2 has functioned like this for quite a while, and we are unlikely to refactor it on short notice. |
Of course, I understand this well, and would not ask for that. We have already implemented what is a viable short-term workaround for this issue, by ensuring that these annotations are set before RKE2 upgrade (https://gitlab.com/sylva-projects/sylva-core/-/issues/1155).
Well, as said above, this works at short term, but for each new version of RKE2 we'll have to check/discover if new such annotations are necessary, and we have to maintain and test the code that ensures that this is done prior to the upgrade. I'd rather prefer an approach where we could "opt out" of this : a configuration flag allowing to not have RKE2 handle these network policies. Or perhaps have them shipping as a Helm chart like some other base charts (e.g. the CNI). Or, for the particular case of network policies related to ingress-ngninx, have them bundled in the ingress-nginx chart (so that we won't not have the network policies if we set But again, the underlying issue behind that looks more important to me: the fact that we can't have any fail-close webhooks on any resource that RKE2 would try to touch during the early stages where kube-proxy isn't ready, is seriously limiting. I of course wouldn't ask for a short term fix on this either, but I'm interested to know what are the plans about this. |
matchConditions are GA in 1.30; I'd like to see folks start using those to exclude system users or groups from webhooks. |
Validated on Version:-$ rke2 version v1.30.2-rc5+rke2r1 (3f678f964ad849e24449e49f0c2c44e75d944c9f)
Environment DetailsInfrastructure Node(s) CPU architecture, OS, and Version: Cluster Configuration:
Steps to validate the fix
Reproduction Issue:
Validation Results:
|
It looks like this was backported to v1.28.11. Is there a recommended workaround solution for folks on earlier 1.28 versions? |
If possible, you can temporarily edit the webhook configuration to fail open so that rke2 can start up successfully. Once that's done you can revert it to the desired configuration. Preferably you would upgrade though. |
Context:
rancher.cattle.io.namespaces
, and also a Kyverno admission webhook)This error is produced by this part of RKE2 code:
rke2/pkg/rke2/np.go
Lines 213 to 225 in bbda824
This code, after applying network policies for namespaces, is annotating those namespaces, which in the presence of webhooks triggering on updates of Namespaces, does not work at this early stage of RKE2 startup (this is due to another issue which has been around for a while, related to the fact that kube-proxy in early stages of RKE2 startup isn't ready to setup connectivity to webhook service, see #4781 (comment)).
The text was updated successfully, but these errors were encountered: