-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: first pass at MHC support #77
Conversation
in my opinion, k3s kcp scale down shall still work without etcd check & operation, but maybe fragile. i perfer to let this feature check in, since user has choice to disable MHC to avoid scale down if he want. In the meanwhile, let's figure out how to deal with etcd. ref #75 |
@zawachte do you think you'd have a chance to take a look at this one? i'll probably hand-build this branch on monday and try it out in staging. i'll report back here if so |
|
||
// Start remediating the unhealthy control plane machine by deleting it. | ||
// A new machine will come up completing the operation as part of the regular reconcile. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mogliang this is where your changes will come into play correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct~
Changes mostly look good to me. Diffing from the kubeadm code it seems that a bunch of the hairier code is directly from there. As you get at in the description, its hard to judge correctness and obviously building out the test infra for this repo is a big TODO. I don't see any regression risks related to this PR so I think we can safely merge it and keep iterating on this feature. |
I don't know how to communicate it to users of this repo, but I do want folks to beware that control-plane remediation isn't fully supported with the merging of this PR. Its still a WIP. |
@lukebond lmk if you are good to merge. |
i'm good to merge thanks @zawachte ! i didn't get around to running it in staging but i do plan to this week. i think it shouldn't block merge and release if correctly signposted as WIP. they need to explicitly set up the MHC anyway |
@zawachte i ran this in staging for a day without creating a CP MHC and it caused no problems. then i created a cluster with our controllers, wrote a MachineHealthCheck for the CP nodes of that new cluster, stopped one of the VMs in the AWS console and observed the CP node was replaced. it works 🎉 for now we're running a hand-rolled build, would be happy for a new release when you have the time! thanks for your time reviewing. |
cc @mogliang |
part of #63
i'm not convinced this is ready but i have questions so using this PR to get them answered and move the PR towards readiness.
this PR takes the remediation code from the kubeadm controlplane provider and brings it into the k3s controlplane provider, with a few changes necessitated by the differences between the two, and things the the k3s controlplane provider is missing due to it being less mature. i'll try to outline those differences in this PR, as well as explain why i think this is more or less good to go, pending a few questions.
the cluster-api healthchecking guideliness
see docs.
if controPlane.KCP.Status.Initialized {
block, which does some checks to bail out of the remediation; those checks don't run if the CP isn't initialised, so remediation can continue. as for the number of CP machines, i've brought across thecanSafelyRemoveEtcdMember
func (with some changes, discussed below), which does a quorum check. this dot point is hardly necessary since the quorum issue is dealt with elsewhere. ✅limitations
mostly the code came across as-is, and is largely copy and paste. i'm no OS licencing expert but my understanding is that it's fine to copy-paste this code from the kubeadm provider. especially since there is already plenty of such copy-pasted code, some still commented out, in the k3s provider codebase.
the main changes involved swapping the imports and data structures used, e.g. things like
controlPlane.GetWorkloadCluster(ctx)
, which this provider doesn't have.the main functional difference here, which is also a limitation and something i want to ask about, is in the
canSafelyRemoveEtcdMember()
function. the kubeadm version gets a list of etcd members by name which it obtains by querying etcd. and storing a list of members in the workload cluster struct. the k3s provider doesn't store such a list and indeed sidesteps talking to etcd at all. there is a comment inworkload_cluster.go
that says// TODO: figure out etcd
and is the place where in the kubeadm provider it goes on to ensure that a node selected is not the etcd master. meaning, i think, that if the k3s provider did a scale down operation there is a chance that it could choose the etcd master. given we don't have that code what i instead do in that function is make the assumption that a CP node is an etcd node. i imagine that's fine for users today but not sure if it's enough long-term.so, question for the maintainers: do we want this addressed before we bring in CP node mediation? should i bring in an etcd client, properly fetch the list of members, and go on to fetch their machines, just like the kubeadm provider does?
tests
you have only the code and my word to go on, since i've not brought over the remediation tests from kubeadm. doing so as part of this PR was going to be a big job, because they aren't simply remediation unit tests, they use bits and pieces of testing helper and util code that i'd need to bring over as well, making it a big job that felt bigger than the actual remediation change itself.
some options i think we have here:
on one hand it would seem harsh if this PR would be held to a higher standard of tests than all the work done to date, but on the other hand this is controlplane node remediation we're talking about!.