-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Enforce consistent Maintenance Mode Strategy behavior #7160
base: master
Are you sure you want to change the base?
Conversation
The Pod deletion filter identifies the Pods belonging to VMs, which need to be deleted during a node drain by looking at their maintenance-mode strategy label. The default value for this label is `Migrate`, which indicates that the Pod should be deleted during the node drain. Pods with an invalid label, i.e. where the value of the maintenance-mode strategy label is not one of: - Migrate - ShutdownAndRestartAfterEnable - ShutdownAndRestartAfterDisable - Shutdown are now also being treated with the default behavior. This fixes the problem that if a VM contains an invalied value in this label, nodes can become stuck in `Cordoned` state when transitioning to maintenance mode, as the node drain controller won't shut down the VM, but it's also not migrated away from the node. Therefore the VM keeps running, preventing the node from completely transitioning into maintenance mode. related-to: harvester#6835 Signed-off-by: Moritz Röhrich <[email protected]>
Add checks for maintenance-mode strategy to VM webhooks. The Admission webhooks for the VirtualMachine resource needs to make sure that the maintenance-mode strategy for the VM is set to a sane value. To do this, there are two checks: The first one is in the mutating webhook, which is executed first, and it just makes sure that the label, which defines the maintenance-mode strategy, is set. If the label is not set, the mutating webhook will set it with the default value of `Migrate`. If the label is set, the mutating webhook will not modify it, even if it has an invalid value. This ensures that the maintenance-mode strategy is never set unintentionally to a wrong value. The second check will ensure that in this case the request is rejected with an error message, so the user can correct the value of the maintenance-mode strategy label. The mutating webhook will also ensure that the maintenance-mode strategy label is copied from the `.metadata.labels` to `.spec.template.metadata.labels`. This is necessary to ensure that the Pod in which the virtual machine will run will be labeled correctly. The second check happens in the validating webhook. This check ensures that the maintenance-mode strategy label is set to a valid maintenance-mode strategy, i.e. one of the values: - Migrate - ShutdownAndRestartAfterEnable - ShutdownAndRestartAfterDisable - Shutdown This webhook will deny a CREATE or UPDATE, if the new VirtualMachine resource does not contain the maintenance-mode strategy label at all, or if it contains an invalid value. The only exception is an UPDATE to a VirtualMachine resource that already contains an invalid value in the maintenance-mode strategy label and where the value does not change. In this case, the webhook will accept the request. This is crucial, in case the controller needs to deal with an existing VM that has an invalid value in this label (e.g. on a cluster that has been upgraded from an old version, before this label was checked by the admission webhook). In this case, the controller still needs to be able to perform UPDATE operations on the resource, to operate the VM. Together, these two checks ensure that no VirtualMachine resource can be created with an invalid maintenance-mode strategy, or with no maintenance-mode strategy at all. They also make sure that the maintenance-mode strategy can not be removed or changed to an invalid value for existing VirtualMachine resources. related-to: harvester#6835 Signed-off-by: Moritz Röhrich <[email protected]>
A question about the scenario: Some VMs can‘t be migrated due to selector or dedicated devices. And sometimes user doesn’t want a VM to be migrated accidentally. The original design respected this and only process it when a valid strategy is set and skip any unset/invalid marks. |
In that case, the user should set one of the other maintenance-mode strategies. It is documented that VMs without a maintenance-mode strategy are migrated as well: https://docs.harvesterhci.io/v1.4/host/#node-maintenance Users can also manually shut down VMs prior to enabling the maintenance mode, if they want to prevent migration. I've re-worked the label synchronization between VMI resources and Pods, so now the maintenance-mode strategy can be updated without re-starting the VM as well. |
Add new VMI controller, which syncs all labels, which have the `harvesterhci.io` prefix Pod(s) corresponding to the VMI. This new controller ensures that Harvester's label namespace is propagated all the way down to the Pod of a VirtualMachine. Changes to the labels of a VirtualMachine or VirtualMachineInstance do not justify re-creating the Pod, since labels are metadata. But labels like the maintenance-mode strategy label have influence on the behavior of VMs, and they may also be needed for features like load balancers to work properly. This controller syncronizes the labels from the VirtualMachineInstance to the Pod to make sure that any change on the upstream objects is reflected on the Pod. For the maintenance-mode strategy label specifically, this means that the maintenance-mode strategy of a VM can be updated without needing to restart the VM. Signed-off-by: Moritz Röhrich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check those questions, thanks.
} | ||
|
||
// copy labels from VMI to Pod | ||
for label := range harvesterVMILabels { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below is a bit faster
for label, value := range harvesterVMILabels {
pod.Labels[label] = value
}
pod.Labels[label] = harvesterVMILabels[label] | ||
} | ||
|
||
_, err := h.podClient.Update(pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vmi onChange happens many times, use deepEqual to check if pod is really changed then call update will be faster
the deepEqual runs in local go thread; the .Update runs on apiServer, the latter is much slower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made several changes that I think should improve the performance of this entire method.
Avoiding the Update
as much as possible is one of them, but the DeepCopy
also now only happens when the Update
happens.
Additionally, I've replaced a few of the for-loops with functions from the standard library and I've gotten rid of the dynamically growing array.
The last big improvement in this department is that I'm now using a different Pod indexer, which only returns Pods associated with the VirtualMachine, not all Pods running on the node where the VirtualMachineInstance is running, so there are a lot less Pods to look through.
patchOps = append(patchOps, string(patch[:])) | ||
} | ||
|
||
// If the value under .spec.template.metadata.labels does not match the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to differentiate the create
and update
case:
for create
, a wrong strategy should be rejected to tell user the clear error, then he will notice and adjust it;
for update
, it may come from legacy vm, adjust the wrong value is user friendly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mutator will not reject anything, since it's not its job.
The mutator will just make sure that the label is set (or it will set it itself to the default) and that the label in the .spec.template
matches the label in .metadata
.
If the label is set, but to an invalid value, the mutator will not change it, because this would be unexpected by the user.
However, the validator will later reject the operation, if the label is invalid.
@@ -79,6 +79,11 @@ func (m *vmMutator) Create(_ *types.Request, newObj runtime.Object) (types.Patch | |||
return nil, err | |||
} | |||
|
|||
patchOps, err = m.ensureMaintenanceModeStrategyIsSet(vm, patchOps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as below comment, in create
case, do not change user's input silently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the create
case, the user's input will never be changed. The mutator will only add the label if it doesn't exist.
// SyncHarvesterVMILabelsToPod ensures that all Harvester labels (i.e. those | ||
// with the harvesterhci.io/ prefix) from the VirtualMachineInstance are synced | ||
// to the Pod | ||
func (h *VMIController) SyncHarvesterVMILabelsToPod(_ string, vmi *kubevirtv1.VirtualMachineInstance) (*kubevirtv1.VirtualMachineInstance, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function may have performance issue
in which case, the strategy is not synced to to pod correctly? could we handle the specific one, instead of run this funcion on VMI change? VMI changes many times when a new VM is starting/migrating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen performance issues in my testing.
Regardless, I've applied the suggestions from the first round of review and made some more improvements to this method, which all should help with performance or at least robustness.
btw, from 12.20, I will be offline and not review PRs. please move the PR forward if it meets approval, do not wait for me, thanks. |
|
||
logrus.Debugf("Syncing labels %v for VMI %v to Pod", vmi.Labels, vmi.Name) | ||
|
||
harvesterVMILabels := map[string]string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should skip the VMI if the maintenance strategy is Migrate
. This will allow kubevirt to just handle the migration of VM's by watching for nodes being placed into maintenance mode via its own watchers.
We do not need to do anything specific for virt-launcher pods in this case.
The additional benefit of this will be that we will not be patching the pod labels for every launcher pod. It is very likely that Maintenance Stratergy
is only used by a subset of VM's among a large number of end user clusters. This greatly reduces the number of reconciles / update requests to the apiserver for the pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This controller is not just for the maintenance-mode functionality. It syncs all labels with the harvesterhci.io
prefix, which may come in handy later on, e.g. when we need to match Pods to their Rancher downstream clusters, or when we need to label Pods of node-pools, so load-balancers can identify them.
I'm also not a huge fan of implementing this functionality in a controller, but I don't see many other options.
@@ -296,3 +307,41 @@ func (v *vmValidator) checkReservedMemoryAnnotation(vm *kubevirtv1.VirtualMachin | |||
func (v *vmValidator) checkStorageResourceQuota(vm *kubevirtv1.VirtualMachine, oldVM *kubevirtv1.VirtualMachine) error { | |||
return v.rqCalculator.CheckStorageResourceQuota(vm, oldVM) | |||
} | |||
|
|||
func checkMaintenanceModeStrategyIsValid(newVM, oldVM *kubevirtv1.VirtualMachine) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should fine tune this more to only ignore updates when oldVM and newVM migration strategy changes and the new strategy is not a valid strategy.
An old VM from say 1.4.0 which does not have the strategy applied as default will not have a strategy. Any update on this VM will be rejected by the validator, which may have unexpected results.
For example, I deployed the harvester / harvester-webhook to an existing 1.4.0 cluster to mimic the upgrade, and post upgrade, I still have VM's without the migration strategy setup.
As expected hot plug volume attachments are broken to this VM because the webhook is blocking the request.
node-1-dhcp:~ # virtctl addvolume test --volume-name=attachment --persist
You are using a client virtctl version that is different from the KubeVirt version running in the cluster
Client Version: v1.1.0
Server Version: v1.2.2
error adding volume, admission webhook "validator.harvesterhci.io" denied the request: label `harvesterhci.io/maintain-mode-strategy` not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird... seems like the mutating webhook didn't run, otherwise it would have added the label. But then the validating webhook was run.
I'm not sure if virtctl addvolume
is an update
operation or a patch.
Simplify control flow for the maintenance-mode strategy label validation in the VirtualMachine validating admission webhook. Signed-off-by: Moritz Röhrich <[email protected]>
cc1011b
to
e1d1d55
Compare
Improve SyncHarvesterVMILabelsToPod control loop: - Make use of the `maps` module from the standard library for cloning, comparing, copying. Also use `slices` module from the standard library for similar operations on arrays. - Slight performance improvement in for loops over maps by avoiding double lookups of values - Use differen Pod indexer to reduce the number of Pods needed to look through to find the Pods belonging to the VMI - Check that labels really need updating before committing to expensive operations like DeepCopy and Update on the Pods - Remove ad-hoc allocation and dynamic growing of array Signed-off-by: Moritz Röhrich <[email protected]>
Problem:
As described in #6835, there are several problems to be considered here:
Solution:
The solution to the problem consists of a modification to the node drain helper and modifications both the mutating admission webhook and the validating admission webhook for the VirtualMachine resource
The Node Drain Helper
The node drain helper was only marking Pods of VMs which had the maintenance-mode strategy label set to
Migrate
for deletion (i.e. migration off of the node).This behavior would omit all Pods which didn't have this label set at all, or the ones which had it set to an invalid value. Similarly, Pods of VMs which didn't have one of the three other maintenance-mode strategies set would not be shut down. Thereby any VM without a maintenance-mode strategy, or with an invalid maintenance-mode strategy would be left running, preventing the node from entering the maintenance mode successfully. As a result the node would remain stuck in "Cordoned" state, waiting for an operator to intervene and deal with the VMs.
The new behavior of the drain helper is to migrate any VMs off of the node, if the associated Pod either has no maintenance-mode strategy label, the maintenance-mode strategy label is invalid, or the maintenance-mode strategy is specified as
Migrate
. This is the documented default behavior for VMs.All other VMs will have a different maintenance-mode strategy specified and they will thus be shut down (and later restarted according to their maintenance-mode strategy).
Admission Webhook
The admission webhook for the VirtualMachine resource plays a crucial role in making sure that VirtualMachiness have a sane maintenance-mode strategy specified.
First, the mutating webhook ensures that the maintenance-mode strategy label is set on the VirtualMachine resource. If it isn't, it will be set to
Migrate
, which is the default.Then the mutating webhook ensures that the maintenance-mode strategy label that is set on the VirtualMachine resource is propagated into the
.spec.templates.metadata.labels
, which is to ensure that the label is propagated throughout both the VirtualMachineInstance as well as the Pod belonging to that VirtualMachine.The mutating webhook will not modify exising maintenance-mode strategy labels on existing VirtualMachine resources. While the node drain helper will apply the default strategy to these virtual machines, this is to ensure that the user will not be faced with inexplicably changing maintenance-mode strategy labels, if they have an invalid value in their cluster.
The webhook will however log an appropriate message, allowing the user to identify VirtualMachine resources affected by invalid values in the maintenance-mode strategy label and correct it.
The validating webhook takes the role of a safe guard, ensuring that no VirtualMachine can be created or updated with an ill-specified maintenance-mode strategy.
It will reject the creation of VirtualMachine resources without a maintenance-mode strategy label or with an invalid value in the maintenance-mode strategy label.
This ensures that the label can not accidentally be filled with bogus data and that the user is informed.
Related Issue:
Test plan:
Scenario 1
Ensure that new VirtualMachines have a valid maintenance-mode strategy. If it's not given, the default should be applied.
Test 1
No maintenance-mode strategy given.
Setup:
Create a new VirtualMachine and specify no maintenance-mode strategy:
Expected Result
The default strategy
Migrate
should have been added by the mutating admission webhook:Test 2
Valid maintenance-mode strategy given.
Setup:
Create a new VirtualMachine and specify a valid maintenance-mode strategy:
Expected Result
The previously specified maintenance-mode strategy should be persisted in the VirtualMachine resource's metadata:
Scenario 2
If an invalid maintenance-mode strategy is specified, the VirtualMachine resource should be rejected.
Test 1
Invalid maintenance-mode strategy given.
Setup:
Create a new VirtualMachine and specify an invalid maintenance-mode strategy:
Expected Result
The creation should be rejected with a sensible error message
Scenario 3
The specified maintenance-mode strategy should be propagated to all relevant resources.
Test 1
The maintenance-mode strategy is propagated to the VirtualMachineInstance and the corresponding Pod, when the VirtualMachine is created
Setup
Create a new VirtualMachine and specify a valid maintenance-mode strategy:
Expected Result
The VirtualMachine resource, the VirtualMachineInstance resource and the belonging Pod resource all have the specified maintenance-mode strategy label.
Test 2
The maintenance-mode strategy is updated on the VirtualMachineInstance and on the corresponding Pod, when the maintenance-mode strategy is updated on the VirtualMachine resource.
//TODO
Scenario 4
In case a VirtualMachine resource, VirtualMachineInstance resource, or a corresponding Pod exists and either has no maintenance-mode strategy label, or the maintenance-mode strategy label has an invalid value, the default behavior (
Migrate
) should be applied.// TODO