Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: first pass at MHC support #77

Merged
merged 2 commits into from
Dec 7, 2023

Conversation

lukebond
Copy link
Contributor

@lukebond lukebond commented Nov 22, 2023

part of #63

i'm not convinced this is ready but i have questions so using this PR to get them answered and move the PR towards readiness.

this PR takes the remediation code from the kubeadm controlplane provider and brings it into the k3s controlplane provider, with a few changes necessitated by the differences between the two, and things the the k3s controlplane provider is missing due to it being less mature. i'll try to outline those differences in this PR, as well as explain why i think this is more or less good to go, pending a few questions.

the cluster-api healthchecking guideliness

see docs.

  • Only Machines owned by a MachineSet or a KubeadmControlPlane can be remediated by a MachineHealthCheck (since a MachineDeployment uses a MachineSet, then this includes Machines that are part of a MachineDeployment) - this code works with machines owned by the k3s control plane, using the existing structures used elsewhere in the codebase that contain a list of controlplane machines ✅
  • Machines managed by a KubeadmControlPlane are remediated according to the delete-and-recreate guidelines described in the KubeadmControlPlane proposal - that's what the code does that i took over from the kubeadm provider ✅
  • in order to remediate a CP machine either the cluster must not yet be initialised, or it must have at least two CP machines. this is handled by the if controPlane.KCP.Status.Initialized { block, which does some checks to bail out of the remediation; those checks don't run if the CP isn't initialised, so remediation can continue. as for the number of CP machines, i've brought across the canSafelyRemoveEtcdMember func (with some changes, discussed below), which does a quorum check. this dot point is hardly necessary since the quorum issue is dealt with elsewhere. ✅
  • Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP from remediating more machines while the replacement for the previous machine is not yet created. - this is handled in a few places, first the check against an annotation that a remediation is in progress but the node replacement is still in progress. elsewhere there is code to allow a remediation to retry according to some configured policy. again, all taken from the kubeadm implementation. ✅
  • The cluster MUST have no machines with a deletion timestamp. This rule prevents KCP taking actions while the cluster is in a transitional state. - this is handled by a check inside the "if initialized" block to ensure there are no CP machines that are being deleted ✅
  • Remediation MUST preserve etcd quorum - this is satisfied by the quorum check described above. ✅

limitations

mostly the code came across as-is, and is largely copy and paste. i'm no OS licencing expert but my understanding is that it's fine to copy-paste this code from the kubeadm provider. especially since there is already plenty of such copy-pasted code, some still commented out, in the k3s provider codebase.

the main changes involved swapping the imports and data structures used, e.g. things like controlPlane.GetWorkloadCluster(ctx), which this provider doesn't have.

the main functional difference here, which is also a limitation and something i want to ask about, is in the canSafelyRemoveEtcdMember() function. the kubeadm version gets a list of etcd members by name which it obtains by querying etcd. and storing a list of members in the workload cluster struct. the k3s provider doesn't store such a list and indeed sidesteps talking to etcd at all. there is a comment in workload_cluster.go that says // TODO: figure out etcd and is the place where in the kubeadm provider it goes on to ensure that a node selected is not the etcd master. meaning, i think, that if the k3s provider did a scale down operation there is a chance that it could choose the etcd master. given we don't have that code what i instead do in that function is make the assumption that a CP node is an etcd node. i imagine that's fine for users today but not sure if it's enough long-term.

so, question for the maintainers: do we want this addressed before we bring in CP node mediation? should i bring in an etcd client, properly fetch the list of members, and go on to fetch their machines, just like the kubeadm provider does?

tests

you have only the code and my word to go on, since i've not brought over the remediation tests from kubeadm. doing so as part of this PR was going to be a big job, because they aren't simply remediation unit tests, they use bits and pieces of testing helper and util code that i'd need to bring over as well, making it a big job that felt bigger than the actual remediation change itself.

some options i think we have here:

  • leave it as is, assume that since the code is tested in kubeadm then it should work
  • strip out a subset of tests from kubeadm remediation code that are less dependent on some of the helper code so at least we have something to start with, but coverage would likely be poor
  • do a separate testing PR to bring over the supporting bits and then retrospectively bring over the remediation tests, which would then be easier
  • just bring over all that testing support code used throughout the kubeadm provider in this PR and then bring just the remediation tests, laying the groundwork for bring over the rest later

on one hand it would seem harsh if this PR would be held to a higher standard of tests than all the work done to date, but on the other hand this is controlplane node remediation we're talking about!.

@mogliang
Copy link
Collaborator

in my opinion, k3s kcp scale down shall still work without etcd check & operation, but maybe fragile. i perfer to let this feature check in, since user has choice to disable MHC to avoid scale down if he want.

In the meanwhile, let's figure out how to deal with etcd. ref #75

@lukebond
Copy link
Contributor Author

lukebond commented Dec 1, 2023

@zawachte do you think you'd have a chance to take a look at this one?

i'll probably hand-build this branch on monday and try it out in staging. i'll report back here if so


// Start remediating the unhealthy control plane machine by deleting it.
// A new machine will come up completing the operation as part of the regular reconcile.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mogliang this is where your changes will come into play correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct~

@zawachte
Copy link
Collaborator

zawachte commented Dec 6, 2023

@zawachte do you think you'd have a chance to take a look at this one?

i'll probably hand-build this branch on monday and try it out in staging. i'll report back here if so

Changes mostly look good to me. Diffing from the kubeadm code it seems that a bunch of the hairier code is directly from there.

As you get at in the description, its hard to judge correctness and obviously building out the test infra for this repo is a big TODO.

I don't see any regression risks related to this PR so I think we can safely merge it and keep iterating on this feature.

@zawachte
Copy link
Collaborator

zawachte commented Dec 6, 2023

I don't know how to communicate it to users of this repo, but I do want folks to beware that control-plane remediation isn't fully supported with the merging of this PR. Its still a WIP.

@zawachte
Copy link
Collaborator

zawachte commented Dec 6, 2023

@lukebond lmk if you are good to merge.

@lukebond
Copy link
Contributor Author

lukebond commented Dec 7, 2023

i'm good to merge thanks @zawachte ! i didn't get around to running it in staging but i do plan to this week. i think it shouldn't block merge and release if correctly signposted as WIP. they need to explicitly set up the MHC anyway

@zawachte zawachte merged commit c85a596 into k3s-io:main Dec 7, 2023
@lukebond
Copy link
Contributor Author

lukebond commented Dec 8, 2023

@zawachte i ran this in staging for a day without creating a CP MHC and it caused no problems. then i created a cluster with our controllers, wrote a MachineHealthCheck for the CP nodes of that new cluster, stopped one of the VMs in the AWS console and observed the CP node was replaced.

it works 🎉

for now we're running a hand-rolled build, would be happy for a new release when you have the time! thanks for your time reviewing.

@lukebond
Copy link
Contributor Author

lukebond commented Dec 8, 2023

cc @mogliang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants