Fix: Enforce consistent Maintenance Mode Strategy behavior #7160

m-ildefons · 2024-12-13T16:18:30Z

Problem:

As described in #6835, there are several problems to be considered here:

Ensuring that the maintenance-mode strategy is specified
Ensuring that only a valid maintenance-mode strategy can be specified
Ensuring that the specified maintenance-mode strategy label is propagated all the way to all resources where it's expected
Ensuring that default behavior is applied in all cases where a maintenance-mode strategy either isn't specified or isn't valid

Solution:

The solution to the problem consists of a modification to the node drain helper and modifications both the mutating admission webhook and the validating admission webhook for the VirtualMachine resource

The Node Drain Helper

The node drain helper was only marking Pods of VMs which had the maintenance-mode strategy label set to Migrate for deletion (i.e. migration off of the node).
This behavior would omit all Pods which didn't have this label set at all, or the ones which had it set to an invalid value. Similarly, Pods of VMs which didn't have one of the three other maintenance-mode strategies set would not be shut down. Thereby any VM without a maintenance-mode strategy, or with an invalid maintenance-mode strategy would be left running, preventing the node from entering the maintenance mode successfully. As a result the node would remain stuck in "Cordoned" state, waiting for an operator to intervene and deal with the VMs.
The new behavior of the drain helper is to migrate any VMs off of the node, if the associated Pod either has no maintenance-mode strategy label, the maintenance-mode strategy label is invalid, or the maintenance-mode strategy is specified as Migrate. This is the documented default behavior for VMs.
All other VMs will have a different maintenance-mode strategy specified and they will thus be shut down (and later restarted according to their maintenance-mode strategy).

Admission Webhook

The admission webhook for the VirtualMachine resource plays a crucial role in making sure that VirtualMachiness have a sane maintenance-mode strategy specified.
First, the mutating webhook ensures that the maintenance-mode strategy label is set on the VirtualMachine resource. If it isn't, it will be set to Migrate, which is the default.
Then the mutating webhook ensures that the maintenance-mode strategy label that is set on the VirtualMachine resource is propagated into the .spec.templates.metadata.labels, which is to ensure that the label is propagated throughout both the VirtualMachineInstance as well as the Pod belonging to that VirtualMachine.
The mutating webhook will not modify exising maintenance-mode strategy labels on existing VirtualMachine resources. While the node drain helper will apply the default strategy to these virtual machines, this is to ensure that the user will not be faced with inexplicably changing maintenance-mode strategy labels, if they have an invalid value in their cluster.
The webhook will however log an appropriate message, allowing the user to identify VirtualMachine resources affected by invalid values in the maintenance-mode strategy label and correct it.

The validating webhook takes the role of a safe guard, ensuring that no VirtualMachine can be created or updated with an ill-specified maintenance-mode strategy.
It will reject the creation of VirtualMachine resources without a maintenance-mode strategy label or with an invalid value in the maintenance-mode strategy label.
This ensures that the label can not accidentally be filled with bogus data and that the user is informed.

Related Issue:

[BUG] No validation on Maintenance Strategy when Edit as YAML #6835

Test plan:

Scenario 1

Ensure that new VirtualMachines have a valid maintenance-mode strategy. If it's not given, the default should be applied.

Test 1

No maintenance-mode strategy given.

Setup:

Create a new VirtualMachine and specify no maintenance-mode strategy:

type: kubevirt.io.virtualmachine
metadata:
  namespace: default
  labels:
    harvesterhci.io/creator: harvester
    harvesterhci.io/os: linux
  name: test
[...]

Expected Result

The default strategy Migrate should have been added by the mutating admission webhook:

type: kubevirt.io.virtualmachine
metadata:
  namespace: default
  labels:
    harvesterhci.io/maintain-mode-strategy: Migrate
    [...]
  name: test
[...]

Test 2

Valid maintenance-mode strategy given.

Setup:

Create a new VirtualMachine and specify a valid maintenance-mode strategy:

type: kubevirt.io.virtualmachine
metadata:
  namespace: default
  labels:
    harvesterhci.io/maintain-mode-strategy: Shutdown
    [...]
  name: test
[...]

Expected Result

The previously specified maintenance-mode strategy should be persisted in the VirtualMachine resource's metadata:

type: kubevirt.io.virtualmachine
metadata:
  namespace: default
  labels:
    harvesterhci.io/maintain-mode-strategy: Shutdown
    [...]
  name: test
[...]

Scenario 2

If an invalid maintenance-mode strategy is specified, the VirtualMachine resource should be rejected.

Test 1

Invalid maintenance-mode strategy given.

Setup:

Create a new VirtualMachine and specify an invalid maintenance-mode strategy:

type: kubevirt.io.virtualmachine
metadata:
  namespace: default
  labels:
    harvesterhci.io/maintain-mode-strategy: foobar
    [...]
  name: test
[...]

Expected Result

The creation should be rejected with a sensible error message

The request is invalid: metadata.labels[harvesterhci.io/maintain-mode-strategy]: invalid maintenance mode strategy: foobar

Scenario 3

The specified maintenance-mode strategy should be propagated to all relevant resources.

Test 1

The maintenance-mode strategy is propagated to the VirtualMachineInstance and the corresponding Pod, when the VirtualMachine is created

Setup

Create a new VirtualMachine and specify a valid maintenance-mode strategy:

type: kubevirt.io.virtualmachine
metadata:
  namespace: default
  labels:
    harvesterhci.io/maintain-mode-strategy: Shutdown
    [...]
  name: test
[...]

Expected Result

The VirtualMachine resource, the VirtualMachineInstance resource and the belonging Pod resource all have the specified maintenance-mode strategy label.

│ ~ │► kubectl get vm,vmi,pod -l harvesterhci.io/maintain-mode-strategy=Shutdown
NAME                              AGE   STATUS    READY
virtualmachine.kubevirt.io/test   10m   Running   True

NAME                                      AGE   PHASE     IP           NODENAME           READY
virtualmachineinstance.kubevirt.io/test   10m   Running   10.52.1.65   harvester-node-1   True

NAME                           READY   STATUS    RESTARTS   AGE
pod/virt-launcher-test-n5tf2   2/2     Running   0          10m

Test 2

The maintenance-mode strategy is updated on the VirtualMachineInstance and on the corresponding Pod, when the maintenance-mode strategy is updated on the VirtualMachine resource.

//TODO

Scenario 4

In case a VirtualMachine resource, VirtualMachineInstance resource, or a corresponding Pod exists and either has no maintenance-mode strategy label, or the maintenance-mode strategy label has an invalid value, the default behavior (Migrate) should be applied.

// TODO

The Pod deletion filter identifies the Pods belonging to VMs, which need to be deleted during a node drain by looking at their maintenance-mode strategy label. The default value for this label is `Migrate`, which indicates that the Pod should be deleted during the node drain. Pods with an invalid label, i.e. where the value of the maintenance-mode strategy label is not one of: - Migrate - ShutdownAndRestartAfterEnable - ShutdownAndRestartAfterDisable - Shutdown are now also being treated with the default behavior. This fixes the problem that if a VM contains an invalied value in this label, nodes can become stuck in `Cordoned` state when transitioning to maintenance mode, as the node drain controller won't shut down the VM, but it's also not migrated away from the node. Therefore the VM keeps running, preventing the node from completely transitioning into maintenance mode. related-to: harvester#6835 Signed-off-by: Moritz Röhrich <[email protected]>

Add checks for maintenance-mode strategy to VM webhooks. The Admission webhooks for the VirtualMachine resource needs to make sure that the maintenance-mode strategy for the VM is set to a sane value. To do this, there are two checks: The first one is in the mutating webhook, which is executed first, and it just makes sure that the label, which defines the maintenance-mode strategy, is set. If the label is not set, the mutating webhook will set it with the default value of `Migrate`. If the label is set, the mutating webhook will not modify it, even if it has an invalid value. This ensures that the maintenance-mode strategy is never set unintentionally to a wrong value. The second check will ensure that in this case the request is rejected with an error message, so the user can correct the value of the maintenance-mode strategy label. The mutating webhook will also ensure that the maintenance-mode strategy label is copied from the `.metadata.labels` to `.spec.template.metadata.labels`. This is necessary to ensure that the Pod in which the virtual machine will run will be labeled correctly. The second check happens in the validating webhook. This check ensures that the maintenance-mode strategy label is set to a valid maintenance-mode strategy, i.e. one of the values: - Migrate - ShutdownAndRestartAfterEnable - ShutdownAndRestartAfterDisable - Shutdown This webhook will deny a CREATE or UPDATE, if the new VirtualMachine resource does not contain the maintenance-mode strategy label at all, or if it contains an invalid value. The only exception is an UPDATE to a VirtualMachine resource that already contains an invalid value in the maintenance-mode strategy label and where the value does not change. In this case, the webhook will accept the request. This is crucial, in case the controller needs to deal with an existing VM that has an invalid value in this label (e.g. on a cluster that has been upgraded from an old version, before this label was checked by the admission webhook). In this case, the controller still needs to be able to perform UPDATE operations on the resource, to operate the VM. Together, these two checks ensure that no VirtualMachine resource can be created with an invalid maintenance-mode strategy, or with no maintenance-mode strategy at all. They also make sure that the maintenance-mode strategy can not be removed or changed to an invalid value for existing VirtualMachine resources. related-to: harvester#6835 Signed-off-by: Moritz Röhrich <[email protected]>

w13915984028 · 2024-12-19T09:46:13Z

A question about the scenario:

Some VMs can‘t be migrated due to selector or dedicated devices. And sometimes user doesn’t want a VM to be migrated accidentally.

The original design respected this and only process it when a valid strategy is set and skip any unset/invalid marks.

m-ildefons · 2024-12-19T14:40:27Z

A question about the scenario:

Some VMs can‘t be migrated due to selector or dedicated devices. And sometimes user doesn’t want a VM to be migrated accidentally.

The original design respected this and only process it when a valid strategy is set and skip any unset/invalid marks.

In that case, the user should set one of the other maintenance-mode strategies. It is documented that VMs without a maintenance-mode strategy are migrated as well: https://docs.harvesterhci.io/v1.4/host/#node-maintenance
With the maintenance-mode label ShutdownAndRestartAfterDisable, they can even have the VMs be started again automatically, once the node is no longer in maintenance mode.

Users can also manually shut down VMs prior to enabling the maintenance mode, if they want to prevent migration.

I've re-worked the label synchronization between VMI resources and Pods, so now the maintenance-mode strategy can be updated without re-starting the VM as well.

Add new VMI controller, which syncs all labels, which have the `harvesterhci.io` prefix Pod(s) corresponding to the VMI. This new controller ensures that Harvester's label namespace is propagated all the way down to the Pod of a VirtualMachine. Changes to the labels of a VirtualMachine or VirtualMachineInstance do not justify re-creating the Pod, since labels are metadata. But labels like the maintenance-mode strategy label have influence on the behavior of VMs, and they may also be needed for features like load balancers to work properly. This controller syncronizes the labels from the VirtualMachineInstance to the Pod to make sure that any change on the upstream objects is reflected on the Pod. For the maintenance-mode strategy label specifically, this means that the maintenance-mode strategy of a VM can be updated without needing to restart the VM. Signed-off-by: Moritz Röhrich <[email protected]>

w13915984028

Please check those questions, thanks.

pkg/webhook/resources/virtualmachine/validator.go

w13915984028 · 2024-12-19T18:39:51Z

pkg/controller/master/virtualmachine/vmi_controller.go

+		}
+
+		// copy labels from VMI to Pod
+		for label := range harvesterVMILabels {


Below is a bit faster

for label, value := range harvesterVMILabels { pod.Labels[label] = value }

w13915984028 · 2024-12-19T18:42:45Z

pkg/controller/master/virtualmachine/vmi_controller.go

+			pod.Labels[label] = harvesterVMILabels[label]
+		}
+
+		_, err := h.podClient.Update(pod)


vmi onChange happens many times, use deepEqual to check if pod is really changed then call update will be faster

the deepEqual runs in local go thread; the .Update runs on apiServer, the latter is much slower

I've made several changes that I think should improve the performance of this entire method.
Avoiding the Update as much as possible is one of them, but the DeepCopy also now only happens when the Update happens.
Additionally, I've replaced a few of the for-loops with functions from the standard library and I've gotten rid of the dynamically growing array.
The last big improvement in this department is that I'm now using a different Pod indexer, which only returns Pods associated with the VirtualMachine, not all Pods running on the node where the VirtualMachineInstance is running, so there are a lot less Pods to look through.

w13915984028 · 2024-12-19T18:48:59Z

pkg/webhook/resources/virtualmachine/mutator.go

+		patchOps = append(patchOps, string(patch[:]))
+	}
+
+	// If the value under .spec.template.metadata.labels does not match the


we need to differentiate the create and update case:

for create, a wrong strategy should be rejected to tell user the clear error, then he will notice and adjust it;

for update, it may come from legacy vm, adjust the wrong value is user friendly

The mutator will not reject anything, since it's not its job.

The mutator will just make sure that the label is set (or it will set it itself to the default) and that the label in the .spec.template matches the label in .metadata.

If the label is set, but to an invalid value, the mutator will not change it, because this would be unexpected by the user.
However, the validator will later reject the operation, if the label is invalid.

w13915984028 · 2024-12-19T18:50:02Z

pkg/webhook/resources/virtualmachine/mutator.go

@@ -79,6 +79,11 @@ func (m *vmMutator) Create(_ *types.Request, newObj runtime.Object) (types.Patch
 		return nil, err
 	}

+	patchOps, err = m.ensureMaintenanceModeStrategyIsSet(vm, patchOps)


as below comment, in create case, do not change user's input silently

in the create case, the user's input will never be changed. The mutator will only add the label if it doesn't exist.

w13915984028 · 2024-12-19T18:51:57Z

pkg/controller/master/virtualmachine/vmi_controller.go

+// SyncHarvesterVMILabelsToPod ensures that all Harvester labels (i.e. those
+// with the harvesterhci.io/ prefix) from the VirtualMachineInstance are synced
+// to the Pod
+func (h *VMIController) SyncHarvesterVMILabelsToPod(_ string, vmi *kubevirtv1.VirtualMachineInstance) (*kubevirtv1.VirtualMachineInstance, error) {


this function may have performance issue

in which case, the strategy is not synced to to pod correctly? could we handle the specific one, instead of run this funcion on VMI change? VMI changes many times when a new VM is starting/migrating

I haven't seen performance issues in my testing.
Regardless, I've applied the suggestions from the first round of review and made some more improvements to this method, which all should help with performance or at least robustness.

w13915984028 · 2024-12-19T19:37:09Z

btw, from 12.20, I will be offline and not review PRs.

please move the PR forward if it meets approval, do not wait for me, thanks.

ibrokethecloud · 2024-12-20T00:38:42Z

pkg/controller/master/virtualmachine/vmi_controller.go

+
+	logrus.Debugf("Syncing labels %v for VMI %v to Pod", vmi.Labels, vmi.Name)
+
+	harvesterVMILabels := map[string]string{}


we should skip the VMI if the maintenance strategy is Migrate. This will allow kubevirt to just handle the migration of VM's by watching for nodes being placed into maintenance mode via its own watchers.
We do not need to do anything specific for virt-launcher pods in this case.

The additional benefit of this will be that we will not be patching the pod labels for every launcher pod. It is very likely that Maintenance Stratergy is only used by a subset of VM's among a large number of end user clusters. This greatly reduces the number of reconciles / update requests to the apiserver for the pods.

This controller is not just for the maintenance-mode functionality. It syncs all labels with the harvesterhci.io prefix, which may come in handy later on, e.g. when we need to match Pods to their Rancher downstream clusters, or when we need to label Pods of node-pools, so load-balancers can identify them.

I'm also not a huge fan of implementing this functionality in a controller, but I don't see many other options.

ibrokethecloud · 2024-12-20T01:31:25Z

pkg/webhook/resources/virtualmachine/validator.go

@@ -296,3 +307,41 @@ func (v *vmValidator) checkReservedMemoryAnnotation(vm *kubevirtv1.VirtualMachin
 func (v *vmValidator) checkStorageResourceQuota(vm *kubevirtv1.VirtualMachine, oldVM *kubevirtv1.VirtualMachine) error {
 	return v.rqCalculator.CheckStorageResourceQuota(vm, oldVM)
 }
+
+func checkMaintenanceModeStrategyIsValid(newVM, oldVM *kubevirtv1.VirtualMachine) error {


we should fine tune this more to only ignore updates when oldVM and newVM migration strategy changes and the new strategy is not a valid strategy.

An old VM from say 1.4.0 which does not have the strategy applied as default will not have a strategy. Any update on this VM will be rejected by the validator, which may have unexpected results.

For example, I deployed the harvester / harvester-webhook to an existing 1.4.0 cluster to mimic the upgrade, and post upgrade, I still have VM's without the migration strategy setup.

As expected hot plug volume attachments are broken to this VM because the webhook is blocking the request.

node-1-dhcp:~ # virtctl addvolume test --volume-name=attachment --persist You are using a client virtctl version that is different from the KubeVirt version running in the cluster Client Version: v1.1.0 Server Version: v1.2.2 error adding volume, admission webhook "validator.harvesterhci.io" denied the request: label `harvesterhci.io/maintain-mode-strategy` not found

Weird... seems like the mutating webhook didn't run, otherwise it would have added the label. But then the validating webhook was run.
I'm not sure if virtctl addvolume is an update operation or a patch.

Simplify control flow for the maintenance-mode strategy label validation in the VirtualMachine validating admission webhook. Signed-off-by: Moritz Röhrich <[email protected]>

Improve SyncHarvesterVMILabelsToPod control loop: - Make use of the `maps` module from the standard library for cloning, comparing, copying. Also use `slices` module from the standard library for similar operations on arrays. - Slight performance improvement in for loops over maps by avoiding double lookups of values - Use differen Pod indexer to reduce the number of Pods needed to look through to find the Pods belonging to the VMI - Check that labels really need updating before committing to expensive operations like DeepCopy and Update on the Pods - Remove ad-hoc allocation and dynamic growing of array Signed-off-by: Moritz Röhrich <[email protected]>

m-ildefons added 2 commits December 12, 2024 12:09

m-ildefons added area/admission-webhook Kubernetes validation and mutating admission webhooks area/node-maintenance labels Dec 13, 2024

m-ildefons added this to the v1.5.0 milestone Dec 13, 2024

m-ildefons requested review from bk201 and albinsun December 13, 2024 16:18

m-ildefons self-assigned this Dec 13, 2024

bk201 requested review from ibrokethecloud and w13915984028 and removed request for albinsun December 19, 2024 07:57

bk201 approved these changes Dec 19, 2024

View reviewed changes

m-ildefons requested a review from bk201 December 19, 2024 14:32

m-ildefons force-pushed the fix-6835 branch from acc523f to 36b3670 Compare December 19, 2024 14:51

w13915984028 reviewed Dec 19, 2024

View reviewed changes

ibrokethecloud reviewed Dec 20, 2024

View reviewed changes

VM Validataion Webhook: Simplify Flow

ba63039

Simplify control flow for the maintenance-mode strategy label validation in the VirtualMachine validating admission webhook. Signed-off-by: Moritz Röhrich <[email protected]>

m-ildefons force-pushed the fix-6835 branch 2 times, most recently from cc1011b to e1d1d55 Compare December 20, 2024 15:54

m-ildefons force-pushed the fix-6835 branch from e1d1d55 to 733d53d Compare December 20, 2024 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Enforce consistent Maintenance Mode Strategy behavior #7160

Fix: Enforce consistent Maintenance Mode Strategy behavior #7160

m-ildefons commented Dec 13, 2024 •

edited

Loading

w13915984028 commented Dec 19, 2024

m-ildefons commented Dec 19, 2024

w13915984028 left a comment

w13915984028 Dec 19, 2024

w13915984028 Dec 19, 2024

m-ildefons Dec 20, 2024

w13915984028 Dec 19, 2024

m-ildefons Dec 20, 2024

w13915984028 Dec 19, 2024

m-ildefons Dec 20, 2024

w13915984028 Dec 19, 2024

m-ildefons Dec 20, 2024

w13915984028 commented Dec 19, 2024

ibrokethecloud Dec 20, 2024

m-ildefons Dec 20, 2024

ibrokethecloud Dec 20, 2024

m-ildefons Dec 20, 2024


		logrus.Debugf("Syncing labels %v for VMI %v to Pod", vmi.Labels, vmi.Name)

		harvesterVMILabels := map[string]string{}

Fix: Enforce consistent Maintenance Mode Strategy behavior #7160

Are you sure you want to change the base?

Fix: Enforce consistent Maintenance Mode Strategy behavior #7160

Conversation

m-ildefons commented Dec 13, 2024 • edited Loading

Problem:

Solution:

The Node Drain Helper

Admission Webhook

Related Issue:

Test plan:

Scenario 1

Test 1

Setup:

Expected Result

Test 2

Setup:

Expected Result

Scenario 2

Test 1

Setup:

Expected Result

Scenario 3

Test 1

Setup

Expected Result

Test 2

Scenario 4

w13915984028 commented Dec 19, 2024

m-ildefons commented Dec 19, 2024

w13915984028 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

w13915984028 commented Dec 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-ildefons commented Dec 13, 2024 •

edited

Loading