feat: first pass at MHC support #77

lukebond · 2023-11-22T11:45:26Z

part of #63

i'm not convinced this is ready but i have questions so using this PR to get them answered and move the PR towards readiness.

this PR takes the remediation code from the kubeadm controlplane provider and brings it into the k3s controlplane provider, with a few changes necessitated by the differences between the two, and things the the k3s controlplane provider is missing due to it being less mature. i'll try to outline those differences in this PR, as well as explain why i think this is more or less good to go, pending a few questions.

the cluster-api healthchecking guideliness

see docs.

Only Machines owned by a MachineSet or a KubeadmControlPlane can be remediated by a MachineHealthCheck (since a MachineDeployment uses a MachineSet, then this includes Machines that are part of a MachineDeployment) - this code works with machines owned by the k3s control plane, using the existing structures used elsewhere in the codebase that contain a list of controlplane machines ✅
Machines managed by a KubeadmControlPlane are remediated according to the delete-and-recreate guidelines described in the KubeadmControlPlane proposal - that's what the code does that i took over from the kubeadm provider ✅
in order to remediate a CP machine either the cluster must not yet be initialised, or it must have at least two CP machines. this is handled by the if controPlane.KCP.Status.Initialized { block, which does some checks to bail out of the remediation; those checks don't run if the CP isn't initialised, so remediation can continue. as for the number of CP machines, i've brought across the canSafelyRemoveEtcdMember func (with some changes, discussed below), which does a quorum check. this dot point is hardly necessary since the quorum issue is dealt with elsewhere. ✅
Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP from remediating more machines while the replacement for the previous machine is not yet created. - this is handled in a few places, first the check against an annotation that a remediation is in progress but the node replacement is still in progress. elsewhere there is code to allow a remediation to retry according to some configured policy. again, all taken from the kubeadm implementation. ✅
The cluster MUST have no machines with a deletion timestamp. This rule prevents KCP taking actions while the cluster is in a transitional state. - this is handled by a check inside the "if initialized" block to ensure there are no CP machines that are being deleted ✅
Remediation MUST preserve etcd quorum - this is satisfied by the quorum check described above. ✅

limitations

mostly the code came across as-is, and is largely copy and paste. i'm no OS licencing expert but my understanding is that it's fine to copy-paste this code from the kubeadm provider. especially since there is already plenty of such copy-pasted code, some still commented out, in the k3s provider codebase.

the main changes involved swapping the imports and data structures used, e.g. things like controlPlane.GetWorkloadCluster(ctx), which this provider doesn't have.

the main functional difference here, which is also a limitation and something i want to ask about, is in the canSafelyRemoveEtcdMember() function. the kubeadm version gets a list of etcd members by name which it obtains by querying etcd. and storing a list of members in the workload cluster struct. the k3s provider doesn't store such a list and indeed sidesteps talking to etcd at all. there is a comment in workload_cluster.go that says // TODO: figure out etcd and is the place where in the kubeadm provider it goes on to ensure that a node selected is not the etcd master. meaning, i think, that if the k3s provider did a scale down operation there is a chance that it could choose the etcd master. given we don't have that code what i instead do in that function is make the assumption that a CP node is an etcd node. i imagine that's fine for users today but not sure if it's enough long-term.

so, question for the maintainers: do we want this addressed before we bring in CP node mediation? should i bring in an etcd client, properly fetch the list of members, and go on to fetch their machines, just like the kubeadm provider does?

tests

you have only the code and my word to go on, since i've not brought over the remediation tests from kubeadm. doing so as part of this PR was going to be a big job, because they aren't simply remediation unit tests, they use bits and pieces of testing helper and util code that i'd need to bring over as well, making it a big job that felt bigger than the actual remediation change itself.

some options i think we have here:

leave it as is, assume that since the code is tested in kubeadm then it should work
strip out a subset of tests from kubeadm remediation code that are less dependent on some of the helper code so at least we have something to start with, but coverage would likely be poor
do a separate testing PR to bring over the supporting bits and then retrospectively bring over the remediation tests, which would then be easier
just bring over all that testing support code used throughout the kubeadm provider in this PR and then bring just the remediation tests, laying the groundwork for bring over the rest later

on one hand it would seem harsh if this PR would be held to a higher standard of tests than all the work done to date, but on the other hand this is controlplane node remediation we're talking about!.

mogliang · 2023-11-23T02:03:48Z

in my opinion, k3s kcp scale down shall still work without etcd check & operation, but maybe fragile. i perfer to let this feature check in, since user has choice to disable MHC to avoid scale down if he want.

In the meanwhile, let's figure out how to deal with etcd. ref #75

lukebond · 2023-12-01T17:04:26Z

@zawachte do you think you'd have a chance to take a look at this one?

i'll probably hand-build this branch on monday and try it out in staging. i'll report back here if so

zawachte · 2023-12-06T12:44:55Z

controlplane/controllers/remediation.go

+
+		// Start remediating the unhealthy control plane machine by deleting it.
+		// A new machine will come up completing the operation as part of the regular reconcile.
+


@mogliang this is where your changes will come into play correct?

zawachte · 2023-12-06T12:50:07Z

@zawachte do you think you'd have a chance to take a look at this one?

i'll probably hand-build this branch on monday and try it out in staging. i'll report back here if so

Changes mostly look good to me. Diffing from the kubeadm code it seems that a bunch of the hairier code is directly from there.

As you get at in the description, its hard to judge correctness and obviously building out the test infra for this repo is a big TODO.

I don't see any regression risks related to this PR so I think we can safely merge it and keep iterating on this feature.

zawachte · 2023-12-06T12:51:20Z

I don't know how to communicate it to users of this repo, but I do want folks to beware that control-plane remediation isn't fully supported with the merging of this PR. Its still a WIP.

zawachte · 2023-12-06T12:51:59Z

@lukebond lmk if you are good to merge.

lukebond · 2023-12-07T13:41:27Z

i'm good to merge thanks @zawachte ! i didn't get around to running it in staging but i do plan to this week. i think it shouldn't block merge and release if correctly signposted as WIP. they need to explicitly set up the MHC anyway

lukebond · 2023-12-08T13:01:35Z

@zawachte i ran this in staging for a day without creating a CP MHC and it caused no problems. then i created a cluster with our controllers, wrote a MachineHealthCheck for the CP nodes of that new cluster, stopped one of the VMs in the AWS console and observed the CP node was replaced.

it works 🎉

for now we're running a hand-rolled build, would be happy for a new release when you have the time! thanks for your time reviewing.

lukebond · 2023-12-08T13:01:46Z

cc @mogliang

lukebond added 2 commits November 21, 2023 17:34

feat: first pass at MHC support

54cdfe3

chore: forgot to regen

32a5863

mogliang approved these changes Nov 24, 2023

View reviewed changes

zawachte reviewed Dec 6, 2023

View reviewed changes

zawachte approved these changes Dec 6, 2023

View reviewed changes

zawachte merged commit c85a596 into k3s-io:main Dec 7, 2023

lukebond mentioned this pull request Dec 8, 2023

MachineHealthCheck support #63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: first pass at MHC support #77

feat: first pass at MHC support #77

lukebond commented Nov 22, 2023 •

edited

Loading

mogliang commented Nov 23, 2023

lukebond commented Dec 1, 2023

zawachte Dec 6, 2023

mogliang Dec 6, 2023

zawachte commented Dec 6, 2023

zawachte commented Dec 6, 2023

zawachte commented Dec 6, 2023

lukebond commented Dec 7, 2023

lukebond commented Dec 8, 2023 •

edited

Loading

lukebond commented Dec 8, 2023


		// Start remediating the unhealthy control plane machine by deleting it.
		// A new machine will come up completing the operation as part of the regular reconcile.

feat: first pass at MHC support #77

feat: first pass at MHC support #77

Conversation

lukebond commented Nov 22, 2023 • edited Loading

the cluster-api healthchecking guideliness

limitations

tests

mogliang commented Nov 23, 2023

lukebond commented Dec 1, 2023

zawachte Dec 6, 2023

Choose a reason for hiding this comment

mogliang Dec 6, 2023

Choose a reason for hiding this comment

zawachte commented Dec 6, 2023

zawachte commented Dec 6, 2023

zawachte commented Dec 6, 2023

lukebond commented Dec 7, 2023

lukebond commented Dec 8, 2023 • edited Loading

lukebond commented Dec 8, 2023

lukebond commented Nov 22, 2023 •

edited

Loading

lukebond commented Dec 8, 2023 •

edited

Loading